Dataset

Open evaluation datasets for healthcare AI agents, covering memory retrieval and hallucination detection

Synthetic Health Profiles

An event-driven mechanism generates longitudinal synthetic health profiles spanning demographics, chronic-disease history, medication records, wearable-device metric streams, periodic exam results, and structured life-event logs. Each virtual user carries a complete health trajectory spanning months to years, simulating real-world long-term health changes, anomaly progression, and multi-factor interactions.

Generated through a programmatic three-stage pipeline (user initialization → event-driven simulation → structured export), with verifiable ground truth and event-dependency relations automatically constructed — supporting standardized benchmark evaluation of healthcare AI agents on memory, reasoning, trend analysis, anomaly detection, and medical-safety tasks.

User Profile

Demographics, chronic conditions, lifestyle, medication history

Health Timeline

Daily device indicators, exam records, unified temporal view

Event Log

Life events with temporal kernels and per-indicator effects

Data hosted on HuggingFace, fully open-source with community oversight.

Update Mechanism

Datasets follow a rolling release cadence; each benchmark ships new batches on its own schedule (batch IDs use YYYYMM — e.g. 202603 = March 2026).

Batch Generation — Each benchmark produces a new batch of cases via its own synthesis or construction pipeline (see each dataset tab's "Construction" field)

Publish to HuggingFace — manifest.json is updated and the new batch is uploaded to the repository; older batches are kept for traceability and longitudinal reproduction

Platform Auto-sync — HMA detects manifest changes automatically, syncs the latest version info, and re-evaluates listed agents against the new dataset

Rolling updates refresh evaluation data periodically to prevent overfitting on a fixed question set; historical versions remain available for longitudinal comparison.

ESL-Bench

An event-driven longitudinal health-memory benchmark that evaluates an agent's ability to memorize and reason over long-term user health data.

Theme

Longitudinal health-memory dataset. Event streams simulate real users' multi-year health changes, stressing an Agent's ability to recall and reason over long-term, multi-source data.

Composition

40 synthetic users × 1–5 years of health trajectory → 200 evaluation cases. Each case includes baseline profile, event timeline (history), evaluation prompt, and ground truth.

Construction

User profiles are synthesized from a knowledge graph; health events are injected into the timeline following medical causal relations. Ground truth is programmatically derived from the graph — fully reproducible and auditable.

Coverage

Five health-ability dimensions evaluated under a unified schema; answer types span numeric, boolean, list, and free-form explanation.

LookupTrendComparisonAnomalyExplanation

Version List

Batch	Release Date	Users	Cases	Download
202606latest	2026-06-28	40	200	HuggingFace
202605	2026-04-28	50	200	HuggingFace
202604	2026-03-24	45	—	HuggingFace
202603	2026-03-04	120	—	HuggingFace

Data Sample

One representative test case from the dataset.

{
  "id": "kg_trend_042",
  "title": "ESL-Bench Trend Dimension Sample",
  "user": {
    "type": "manual",
    "strict_inputs": [
      "In which month did resting heart rate show the largest month-over-month change?"
    ]
  },
  "eval": {
    "evaluator": "semantic",
    "dimension": "trend",
    "expected_value": "2024-07",
    "answer_type": "date",
    "key_points": [
      "Month-level aggregation required",
      "Rate of change computation"
    ],
    "source_data": "timeline.json (resting_heart_rate)"
  }
}

Dataset File Structure

Dataset hosted on HuggingFace, organized in batch / user two-level directories. Each batch named <codeYYYYMM/> (e.g., <code202603/>), representing the latest evaluation data for that month. Only the latest batch is used for evaluation; historical batches are preserved for version tracking and longitudinal comparison.

# HuggingFace repository root

manifest.json# Version index: release dates, user lists, checksums for all batches

data/

202603/# Latest batch (each batch contains multiple virtual-user subdirectories)

user100_AT_demo/# Virtual user directory (directory name = user identifier)

profile.json

timeline.json

kg_evaluation_queries.json

user101_AT_demo/# Same structure, 6 files per user

...

202602/# Historical batch (preserved for version tracking, not used in current evaluation)

...

Per-User File Details

profile.json

User Profile (Profile p_i): demographics, chronic conditions, lifestyle, medication history in structured JSON

timeline.json

Complete Health Timeline: chronologically sorted unified temporal view of all device indicators, exam data, and health events

kg_evaluation_queries.json

Evaluation Questions: five-dimension questions (Lookup/Trend/Comparison/Anomaly/Explanation), each with expected_value, answer_type, key_points, and source_data references

manifest.json is the repository's version index file, recording creation time, user lists, and checksums for all batches. The HMA platform automatically detects dataset updates by reading this file.