Dataset
Event-driven synthetic longitudinal health evaluation dataset, monthly updates, hosted on HuggingFace, covering five evaluation dimensions
The test dataset is available for download on the last day of each month. The leaderboard submission portal opens from 6:00 PM – 10:00 PM (UTC-7) on the same day — please complete your upload within the submission window. The leaderboard will refresh on the 1st of the following month. The validation set is permanently open, and HMA will update it from time to time — stay tuned!
Released on 2026-03-24
Version History
| Batch | Release Date | Users | Cases | Download |
|---|---|---|---|---|
| 202604latest | 2026-03-24 | 45 | 50 | HuggingFace |
| 202603 | 2026-03-04 | 120 | 2,000 | HuggingFace |
Dataset Overview
The ESL-Bench dataset is built on an event-driven synthetic data generation framework, which models health events as first-class temporal objects with explicit physiological response kernels, generating longitudinal multi-modal health records through a three-stage pipeline (user initialization → event-driven daily simulation → structured export). ESL-Bench is specifically designed to evaluate structured retrieval, planning and temporal reasoning capabilities of longitudinal health agents.
Each virtual user covers 1-3 years of complete health trajectory — including personal profile, daily device indicator streams, sparse exam records, and structured event logs. Events drive indicator changes through explicit sigmoid-onset and exponential-decay temporal kernels. All evaluation question answers are programmatically computed from exported structured data, ensuring deterministic and verifiable ground truth, fundamentally solving medical data compliance challenges (no real patient data needed).
Dataset hosted on HuggingFace, fully open-source with community oversight and correction.
Evaluation Dimensions
Each virtual user's 100 questions cover 5 evaluation dimensions (20 per dimension), from data lookup to causal explanation:
Direct data retrieval: query user profile attributes, device indicator values on specific dates, exam results, and event properties. Includes adversarial indicator pairs (e.g., blood WBC vs. urine WBC, hs-CRP vs. CRP) to prevent shortcuts.
Temporal trend analysis: including monthly aggregation, rate of change computation, consecutive trend identification, volatility analysis, and regime change detection. Tests agent's pattern recognition on temporal data.
Cross-event or cross-source comparison: pre/post event indicator changes, shared indicator overlap between events, and severity ranking. Text retrieval methods begin falling significantly behind at this dimension.
Anomaly detection and tracking: threshold exceedance checks, abnormal streak counting, multi-indicator abnormal cluster identification, and cross-exam deterioration trend tracking. Tests agent's ability to detect and track health anomalies.
Causal attribution and evidence organization: identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize evidence chains. Uses programmatic checks + rubric hybrid scoring.
Update Mechanism
The dataset uses a monthly update mechanism, releasing a new batch each month (e.g., 202603 represents the March 2026 batch).
Monthly updates ensure evaluation data freshness, prevent overfitting to fixed question sets, while preserving historical versions for longitudinal comparison.
Dataset File Structure
Dataset hosted on HuggingFace, organized in batch / user two-level directories. Each batch named <codeYYYYMM/> (e.g., <code202603/>), representing the latest evaluation data for that month. Only the latest batch is used for evaluation; historical batches are preserved for version tracking and longitudinal comparison.
Per-User File Details
profile.jsonUser Profile (Profile p_i): demographics, chronic conditions, lifestyle, medication history in structured JSON
timeline.jsonComplete Health Timeline: chronologically sorted unified temporal view of all device indicators, exam data, and health events
kg_evaluation_queries.jsonEvaluation Questions: 100 five-dimension questions (Lookup/Trend/Comparison/Anomaly/Explanation), each with expected_value, answer_type, key_points, and source_data references
manifest.json is the repository's version index file, recording creation time, user lists, and checksums for all batches. The HMA platform automatically detects dataset updates by reading this file.
ESL-Bench Data Bounty
We're launching the ESL-Bench Data Bounty Program — an open call for the community to help improve ESL-Bench, the evaluation dataset of Health Memory Arena (HMA), the first comprehensive benchmark for assessing memory capabilities of health AI agents. This program runs from March 31 to April 30, 2026.
Valid bug reports earn a cash reward ($10–$50 USD / ¥100–¥500 RMB). Reward details will be confirmed via official email upon approval. Only the first submission of a given issue will be rewarded. Please search existing Discussions before submitting.
What Qualifies
- Q&A pair errors (wrong groundtruth, ambiguous questions)
- Data logic inconsistencies across subsets
- Annotation or labeling errors
- Benchmark design issues that affect evaluation validity
What Doesn't Qualify
- Subjective opinions without supporting evidence
- Duplicate reports of existing issues
- Incomplete submissions (missing required fields)
How to Submit
Post a new Discussion titled [Data Bounty] with the following info:
Subset (e.g. user_events / qa_pairs) · Row/ID (e.g. User_ID_045) · Error Type (Causal / Value / Temporal / Missing / Systematic / Q&A) · Description (what's wrong and why) · Correction (suggested fix) · Reference (PubMed / WHO / guideline, if applicable)
We'll respond within 5 business days. Valid submissions will be notified via Discussion and rewarded by official email.
Sample Submission
[Data Bounty] Missing / Empty timeline.json for 5 users — 23 queries unanswerable from source data
Subset: data/202604
Row/ID: user5027_AT_demo
Error Type: Q&A Pair Error
Description: The kg_evaluation_queries.json for the above users contains queries requiring device indicator data from timeline.json, but the corresponding files are either missing or empty (0 records), making 23 queries unverifiable.
Correction: Provide complete timeline.json files for all affected users
Reference: Data integrity issue — no medical literature required
For questions, leave a comment in HuggingFace Discussions or email support@healthmemoryarena.ai