Methodology
Learn about ESL-Bench's five-dimension evaluation system, scoring protocol, and agent integration
Scoring System
Unified Two-Stage Scoring
All dimensions use a unified two-stage scoring protocol: programmatic checks first (JSON schema parsing + ground truth comparison with numerical tolerance), then LLM rubric evaluation for answer quality.
Weighted Aggregation
All dimensions have equal weights (0.20 each), Total = Σ(D_i × W_i), ensuring each evaluation dimension contributes equally to the final score.
Deterministic Ground Truth
All standard answers are programmatically computed from exported structured data (Profile, Device Stream, Exam Visits, Event Log), requiring no external clinical knowledge, ensuring verifiability.
Version Consistency
Evaluation results within the same dataset version are directly comparable. Major monthly updates automatically trigger re-evaluation of all agents.
Five Evaluation Dimensions
Lookup
Direct data retrieval
query user profile attributes, device indicator values on specific dates, exam results, and event properties. For example, 'What was this user's resting heart rate on 2024-03-15?'. Tests agent's basic retrieval capability on structured health data, including adversarial indicator pairs to prevent shortcuts.
Trend
Temporal trend analysis
monthly aggregation, rate of change computation, consecutive trend identification, and regime change detection. For example, 'In which month did resting heart rate show the largest month-over-month change?'. Requires agents to maintain full temporal context; retrieval-based methods perform poorly due to timeline fragmentation.
Comparison
Cross-event or cross-source comparison
compute pre/post event indicator deltas, compare shared indicators across events, and rank severity. For example, 'How did mean step count change after started jogging routine compared with 14 days before?'. Requires cross-event joins and precise temporal window alignment; text retrieval methods fall behind due to scattered evidence.
Anomaly
Anomaly detection and tracking
check whether indicators exceed thresholds, count abnormal consecutive days, and identify multi-indicator abnormal clusters. For example, 'Has fasting glucose ever been marked abnormal? Which exams had both fasting glucose and HbA1c marked abnormal?'. Requires multi-condition filtering and cross-exam comparison.
Explanation
Causal attribution and evidence organization
identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize complete evidence chains. For example, 'This user's fasting blood glucose decreased from baseline — identify contributing events, rank by contribution, and organize supporting evidence'. This is the most challenging dimension, requiring mechanism recovery and evidence ranking capabilities.
Paper & Citation
ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent
Evaluating longitudinal health agents requires benchmarks with multi-source trajectories and deterministic ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark that provides 20 synthetic users, each with a 1–5 year health trajectory comprising a structured profile, daily device measurements, periodic exam records, and an event log carrying explicit per-indicator impact parameters. Each user is paired with 100 evaluation queries organized along five dimensions — Lookup, Trend, Comparison, Anomaly, and Explanation — and stratified into Easy, Medium, and Hard tiers. Because every event–indicator relationship is recorded with explicit temporal parameters, all ground-truth answers are programmatically computable.
@article{eslbench2026,
title={ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent},
author={Shanda Group},
journal={arXiv preprint arXiv:2604.02834},
year={2026}
}