Methodology

Learn about ESL-Bench's five-dimension evaluation system, scoring protocol, and agent integration

Scoring System

Total Score = Σ(Li × Wi)
W = [0.20, 0.20, 0.20, 0.20, 0.20]

Unified Two-Stage Scoring

All dimensions use a unified two-stage scoring protocol: programmatic checks first (JSON schema parsing + ground truth comparison with numerical tolerance), then LLM rubric evaluation for answer quality.

Weighted Aggregation

All dimensions have equal weights (0.20 each), Total = Σ(D_i × W_i), ensuring each evaluation dimension contributes equally to the final score.

Deterministic Ground Truth

All standard answers are programmatically computed from exported structured data (Profile, Device Stream, Exam Visits, Event Log), requiring no external clinical knowledge, ensuring verifiability.

Version Consistency

Evaluation results within the same dataset version are directly comparable. Major monthly updates automatically trigger re-evaluation of all agents.

Five Evaluation Dimensions

Lookup

Direct data retrieval

query user profile attributes, device indicator values on specific dates, exam results, and event properties. For example, 'What was this user's resting heart rate on 2024-03-15?'. Tests agent's basic retrieval capability on structured health data, including adversarial indicator pairs to prevent shortcuts.

Trend

Temporal trend analysis

monthly aggregation, rate of change computation, consecutive trend identification, and regime change detection. For example, 'In which month did resting heart rate show the largest month-over-month change?'. Requires agents to maintain full temporal context; retrieval-based methods perform poorly due to timeline fragmentation.

Comparison

Cross-event or cross-source comparison

compute pre/post event indicator deltas, compare shared indicators across events, and rank severity. For example, 'How did mean step count change after started jogging routine compared with 14 days before?'. Requires cross-event joins and precise temporal window alignment; text retrieval methods fall behind due to scattered evidence.

Anomaly

Anomaly detection and tracking

check whether indicators exceed thresholds, count abnormal consecutive days, and identify multi-indicator abnormal clusters. For example, 'Has fasting glucose ever been marked abnormal? Which exams had both fasting glucose and HbA1c marked abnormal?'. Requires multi-condition filtering and cross-exam comparison.

Explanation

Causal attribution and evidence organization

identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize complete evidence chains. For example, 'This user's fasting blood glucose decreased from baseline — identify contributing events, rank by contribution, and organize supporting evidence'. This is the most challenging dimension, requiring mechanism recovery and evidence ranking capabilities.

Paper & Citation

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent

Evaluating longitudinal health agents requires benchmarks with multi-source trajectories and deterministic ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark that provides 20 synthetic users, each with a 1–5 year health trajectory comprising a structured profile, daily device measurements, periodic exam records, and an event log carrying explicit per-indicator impact parameters. Each user is paired with 100 evaluation queries organized along five dimensions — Lookup, Trend, Comparison, Anomaly, and Explanation — and stratified into Easy, Medium, and Hard tiers. Because every event–indicator relationship is recorded with explicit temporal parameters, all ground-truth answers are programmatically computable.

BibTeX Citationbibtex
@article{eslbench2026,
  title={ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent},
  author={Shanda Group},
  journal={arXiv preprint arXiv:2604.02834},
  year={2026}
}