Dataset

Event-driven synthetic longitudinal health evaluation dataset, monthly updates, hosted on HuggingFace, covering five evaluation dimensions

The test dataset is available for download on the last day of each month. The leaderboard submission portal opens from 6:00 PM – 10:00 PM (UTC-7) on the same day — please complete your upload within the submission window. The leaderboard will refresh on the 1st of the following month. The validation set is permanently open, and HMA will update it from time to time — stay tuned!

202604LATEST

Released on 2026-03-24

View Dataset
0
Test Cases
0
Virtual Users
0
Difficulty Levels
0
Versions

Version History

BatchRelease DateUsersCasesDownload
202604latest2026-03-244550HuggingFace
2026032026-03-041202,000HuggingFace

Dataset Overview

The ESL-Bench dataset is built on an event-driven synthetic data generation framework, which models health events as first-class temporal objects with explicit physiological response kernels, generating longitudinal multi-modal health records through a three-stage pipeline (user initialization → event-driven daily simulation → structured export). ESL-Bench is specifically designed to evaluate structured retrieval, planning and temporal reasoning capabilities of longitudinal health agents.

Each virtual user covers 1-3 years of complete health trajectory — including personal profile, daily device indicator streams, sparse exam records, and structured event logs. Events drive indicator changes through explicit sigmoid-onset and exponential-decay temporal kernels. All evaluation question answers are programmatically computed from exported structured data, ensuring deterministic and verifiable ground truth, fundamentally solving medical data compliance challenges (no real patient data needed).

Dataset hosted on HuggingFace, fully open-source with community oversight and correction.

Evaluation Dimensions

Each virtual user's 100 questions cover 5 evaluation dimensions (20 per dimension), from data lookup to causal explanation:

LOOKUP20%

Direct data retrieval: query user profile attributes, device indicator values on specific dates, exam results, and event properties. Includes adversarial indicator pairs (e.g., blood WBC vs. urine WBC, hs-CRP vs. CRP) to prevent shortcuts.

What was the user's resting heart rate on 2024-03-15?TSH value and reference range from an exam?Which event affected the most indicators?
TREND20%

Temporal trend analysis: including monthly aggregation, rate of change computation, consecutive trend identification, volatility analysis, and regime change detection. Tests agent's pattern recognition on temporal data.

In which month did resting heart rate show the largest month-over-month change?Best/worst quarter means for an indicator?At which date did an indicator exhibit a regime change?
COMPARISON20%

Cross-event or cross-source comparison: pre/post event indicator changes, shared indicator overlap between events, and severity ranking. Text retrieval methods begin falling significantly behind at this dimension.

Step count change before vs. after started jogging?Which events share affected indicators with high-sodium diet?Multi-event severity ranking comparison?
ANOMALY20%

Anomaly detection and tracking: threshold exceedance checks, abnormal streak counting, multi-indicator abnormal cluster identification, and cross-exam deterioration trend tracking. Tests agent's ability to detect and track health anomalies.

Has fasting glucose ever been marked abnormal?Which exams had both fasting glucose and HbA1c marked abnormal?Multi-indicator abnormal cluster and causal chain analysis?
EXPLANATION20%

Causal attribution and evidence organization: identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize evidence chains. Uses programmatic checks + rubric hybrid scoring.

Contributing events and ranking for fasting blood glucose decrease?Counterfactual: expected indicator change if an event were removed?Evidence chain-based attribution explanation for indicator changes?

Update Mechanism

The dataset uses a monthly update mechanism, releasing a new batch each month (e.g., 202603 represents the March 2026 batch).

1
Data GenerationGenerate new virtual user data via ThetaGen event-driven synthesis engine, ~20 users per batch with 100 evaluation questions each
2
Publish to HuggingFaceUpdate manifest.json and upload new batch user data to repository, old batches preserved
3
Platform Auto-syncHMA platform automatically detects manifest changes, syncs latest version info, and re-evaluates leaderboard agents with the latest dataset

Monthly updates ensure evaluation data freshness, prevent overfitting to fixed question sets, while preserving historical versions for longitudinal comparison.

Dataset File Structure

Dataset hosted on HuggingFace, organized in batch / user two-level directories. Each batch named <codeYYYYMM/> (e.g., <code202603/>), representing the latest evaluation data for that month. Only the latest batch is used for evaluation; historical batches are preserved for version tracking and longitudinal comparison.

# HuggingFace repository root
manifest.json# Version index: release dates, user lists, checksums for all batches
data/
202603/# Latest batch (March 2026 release, 20 virtual users)
user100_AT_demo/# Virtual user directory (directory name = user identifier)
profile.json
timeline.json
kg_evaluation_queries.json
user101_AT_demo/# Same structure, 6 files per user
...
202602/# Historical batch (preserved for version tracking, not used in current evaluation)
...

Per-User File Details

profile.json

User Profile (Profile p_i): demographics, chronic conditions, lifestyle, medication history in structured JSON

timeline.json

Complete Health Timeline: chronologically sorted unified temporal view of all device indicators, exam data, and health events

kg_evaluation_queries.json

Evaluation Questions: 100 five-dimension questions (Lookup/Trend/Comparison/Anomaly/Explanation), each with expected_value, answer_type, key_points, and source_data references

manifest.json is the repository's version index file, recording creation time, user lists, and checksums for all batches. The HMA platform automatically detects dataset updates by reading this file.

ESL-Bench Data Bounty

We're launching the ESL-Bench Data Bounty Program — an open call for the community to help improve ESL-Bench, the evaluation dataset of Health Memory Arena (HMA), the first comprehensive benchmark for assessing memory capabilities of health AI agents. This program runs from March 31 to April 30, 2026.

$10–$50 / report

Valid bug reports earn a cash reward ($10–$50 USD / ¥100–¥500 RMB). Reward details will be confirmed via official email upon approval. Only the first submission of a given issue will be rewarded. Please search existing Discussions before submitting.

What Qualifies

  • Q&A pair errors (wrong groundtruth, ambiguous questions)
  • Data logic inconsistencies across subsets
  • Annotation or labeling errors
  • Benchmark design issues that affect evaluation validity

What Doesn't Qualify

  • Subjective opinions without supporting evidence
  • Duplicate reports of existing issues
  • Incomplete submissions (missing required fields)

How to Submit

Post a new Discussion titled [Data Bounty] with the following info:

Subset (e.g. user_events / qa_pairs) · Row/ID (e.g. User_ID_045) · Error Type (Causal / Value / Temporal / Missing / Systematic / Q&A) · Description (what's wrong and why) · Correction (suggested fix) · Reference (PubMed / WHO / guideline, if applicable)

We'll respond within 5 business days. Valid submissions will be notified via Discussion and rewarded by official email.

Sample Submission

[Data Bounty] Missing / Empty timeline.json for 5 users — 23 queries unanswerable from source data

Subset: data/202604

Row/ID: user5027_AT_demo

Error Type: Q&A Pair Error

Description: The kg_evaluation_queries.json for the above users contains queries requiring device indicator data from timeline.json, but the corresponding files are either missing or empty (0 records), making 23 queries unverifiable.

Correction: Provide complete timeline.json files for all affected users

Reference: Data integrity issue — no medical literature required

Submit ReportThe HMA team reserves the right to final interpretation of all report validity

For questions, leave a comment in HuggingFace Discussions or email support@healthmemoryarena.ai