Dataset
Open evaluation datasets for healthcare AI agents, covering memory retrieval and hallucination detection
Synthetic Health Profiles
An event-driven mechanism generates longitudinal synthetic health profiles spanning demographics, chronic-disease history, medication records, wearable-device metric streams, periodic exam results, and structured life-event logs. Each virtual user carries a complete health trajectory spanning months to years, simulating real-world long-term health changes, anomaly progression, and multi-factor interactions.
Generated through a programmatic three-stage pipeline (user initialization → event-driven simulation → structured export), with verifiable ground truth and event-dependency relations automatically constructed — supporting standardized benchmark evaluation of healthcare AI agents on memory, reasoning, trend analysis, anomaly detection, and medical-safety tasks.
Demographics, chronic conditions, lifestyle, medication history
Daily device indicators, exam records, unified temporal view
Life events with temporal kernels and per-indicator effects
Data hosted on HuggingFace, fully open-source with community oversight.
Update Mechanism
Datasets follow a rolling release cadence; each benchmark ships new batches on its own schedule (batch IDs use YYYYMM — e.g. 202603 = March 2026).
Rolling updates refresh evaluation data periodically to prevent overfitting on a fixed question set; historical versions remain available for longitudinal comparison.
ESL-Bench
An event-driven longitudinal health-memory benchmark that evaluates an agent's ability to memorize and reason over long-term user health data.
Longitudinal health-memory dataset. Event streams simulate real users' multi-year health changes, stressing an Agent's ability to recall and reason over long-term, multi-source data.
50 synthetic users × 1–5 years of health trajectory → 200 evaluation cases. Each case includes baseline profile, event timeline (history), evaluation prompt, and ground truth.
User profiles are synthesized from a knowledge graph; health events are injected into the timeline following medical causal relations. Ground truth is programmatically derived from the graph — fully reproducible and auditable.
Five health-ability dimensions evaluated under a unified schema; answer types span numeric, boolean, list, and free-form explanation.
Version List
| Batch | Release Date | Users | Cases | Download |
|---|---|---|---|---|
| 202605latest | 2026-04-28 | 50 | 200 | HuggingFace |
| 202604 | 2026-03-24 | 45 | — | HuggingFace |
| 202603 | 2026-03-04 | 120 | — | HuggingFace |
Data Sample
One representative test case from the dataset.
{
"id": "kg_trend_042",
"title": "ESL-Bench Trend Dimension Sample",
"user": {
"type": "manual",
"strict_inputs": [
"In which month did resting heart rate show the largest month-over-month change?"
]
},
"eval": {
"evaluator": "semantic",
"dimension": "trend",
"expected_value": "2024-07",
"answer_type": "date",
"key_points": [
"Month-level aggregation required",
"Rate of change computation"
],
"source_data": "timeline.json (resting_heart_rate)"
}
}Dataset File Structure
Dataset hosted on HuggingFace, organized in batch / user two-level directories. Each batch named <codeYYYYMM/> (e.g., <code202603/>), representing the latest evaluation data for that month. Only the latest batch is used for evaluation; historical batches are preserved for version tracking and longitudinal comparison.
Per-User File Details
profile.jsonUser Profile (Profile p_i): demographics, chronic conditions, lifestyle, medication history in structured JSON
timeline.jsonComplete Health Timeline: chronologically sorted unified temporal view of all device indicators, exam data, and health events
kg_evaluation_queries.jsonEvaluation Questions: five-dimension questions (Lookup/Trend/Comparison/Anomaly/Explanation), each with expected_value, answer_type, key_points, and source_data references
manifest.json is the repository's version index file, recording creation time, user lists, and checksums for all batches. The HMA platform automatically detects dataset updates by reading this file.