Methodology

Benchmark design, task definitions, and scoring protocols

The HMA Evaluation System

Health Memory Arena (HMA) is a reproducible and extensible evaluation platform for healthcare AI agents. Built on the HolyEval framework, it runs a three-stage pipeline — virtual user → target system → automatic evaluation — to quantify agent capabilities systematically across multiple dimensions.

Covered Capabilities

Retrieval & Memory Reasoning

Powered by ESL-Bench: evaluates structured retrieval, temporal reasoning and event-based causal explanation over long-term user health data, across five dimensions — Lookup / Trend / Comparison / Anomaly / Explanation.

Medical Hallucination Detection

Powered by MedHall-Bench: detects plausible-sounding but factually wrong, fabricated, or unverifiable outputs in medical AI responses. Covers classic and data-oriented hallucination categories with multiple sub-types under each.

Safety Compliance & Red-Team Defense

Powered by MedHarm-Bench: probes whether a health AI holds its safety boundaries under inducement — refusing definitive diagnoses, prescriptions, contradicting physicians, or advice that delays urgent care. Covers 5 red-team dimensions, scored by a server-side LLM-as-Judge.

Evaluation Principles

Reproducible: Ground truth is programmatically generated or expert-verified; scoring protocols are deterministic.
Standardized: One unified BenchItem JSON format and a single Web UI submission interface across benchmarks.
Longitudinal: Rolling dataset releases with historical versions preserved, enabling cross-cycle agent capability tracking.

Scoring System

Total Score = Σ(L_i × W_i)

W = [0.20, 0.20, 0.20, 0.20, 0.20]

Unified Two-Stage Scoring

All dimensions use a unified two-stage scoring protocol: programmatic checks first (JSON schema parsing + ground truth comparison with numerical tolerance), then LLM rubric evaluation for answer quality.

Weighted Aggregation

All dimensions have equal weights (0.20 each), Total = Σ(D_i × W_i), ensuring each evaluation dimension contributes equally to the final score.

Deterministic Ground Truth

All standard answers are programmatically computed from exported structured data (Profile, Device Stream, Exam Visits, Event Log), requiring no external clinical knowledge, ensuring verifiability.

Version Consistency

Evaluation results within the same dataset version are directly comparable. Major monthly updates automatically trigger re-evaluation of all agents.

Five Evaluation Dimensions

Lookup

Direct data retrieval

query user profile attributes, device indicator values on specific dates, exam results, and event properties. For example, 'What was this user's resting heart rate on 2024-03-15?'. Tests agent's basic retrieval capability on structured health data, including adversarial indicator pairs to prevent shortcuts.

Proportion

20%

Trend

Temporal trend analysis

monthly aggregation, rate of change computation, consecutive trend identification, and regime change detection. For example, 'In which month did resting heart rate show the largest month-over-month change?'. Requires agents to maintain full temporal context; retrieval-based methods perform poorly due to timeline fragmentation.

Proportion

20%

Comparison

Cross-event or cross-source comparison

compute pre/post event indicator deltas, compare shared indicators across events, and rank severity. For example, 'How did mean step count change after started jogging routine compared with 14 days before?'. Requires cross-event joins and precise temporal window alignment; text retrieval methods fall behind due to scattered evidence.

Proportion

20%

Anomaly

Anomaly detection and tracking

check whether indicators exceed thresholds, count abnormal consecutive days, and identify multi-indicator abnormal clusters. For example, 'Has fasting glucose ever been marked abnormal? Which exams had both fasting glucose and HbA1c marked abnormal?'. Requires multi-condition filtering and cross-exam comparison.

Proportion

20%

Explanation

Causal attribution and evidence organization

identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize complete evidence chains. For example, 'This user's fasting blood glucose decreased from baseline — identify contributing events, rank by contribution, and organize supporting evidence'. This is the most challenging dimension, requiring mechanism recovery and evidence ranking capabilities.

Proportion

20%

Paper & Citation

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent

Evaluating longitudinal health agents requires benchmarks with multi-source trajectories and deterministic ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark that provides 20 synthetic users, each with a 1–5 year health trajectory comprising a structured profile, daily device measurements, periodic exam records, and an event log carrying explicit per-indicator impact parameters. Each user is paired with 100 evaluation queries organized along five dimensions — Lookup, Trend, Comparison, Anomaly, and Explanation — and stratified into Easy, Medium, and Hard tiers. Because every event–indicator relationship is recorded with explicit temporal parameters, all ground-truth answers are programmatically computable.

BibTeX Citationbibtex

@article{eslbench2026,
  title={ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent},
  author={Shanda Group},
  journal={arXiv preprint arXiv:2604.02834},
  year={2026}
}

References

ESL-Bench evaluates an agent's memory and reasoning over long-term user health data. Test items are programmatically generated from a knowledge graph; ground truth is verifiable and traceable.

HMA / ESL-Bench internal technical report (pre-publication paper, see tmp/main.pdf in repo) — covers synthetic user health-trajectory generation, knowledge-graph-based ground-truth computation, and the Lookup / Trend / Comparison / Anomaly / Explanation five-dimension protocol
Theta Health long-term health data synthesis methodology (internal docs) — longitudinal modeling across device signals, routine check-ups, and chronic-disease management

Reproduction

The evaluation framework and datasets are open-source. The same baselines published on HMA can be reproduced locally with the following three steps:

1Clone the HolyEval repo and run uv sync to install dependencies; populate API keys (OPENAI_API_KEY / GOOGLE_API_KEY / etc.) in .env
2Run the CLI evaluation command — HolyEval invokes the corresponding evaluator plugin to compute scores
3Reports are written to benchmark/report/; compare row-by-row against the public baseline snapshot

Reproduction command (example)bash

python -m benchmark.basic_runner eslbench sample --target-model gpt-4.1 -p 5

Reports land at benchmark/report/eslbench/ with the naming pattern {dataset}_{target_label}_{timestamp}.json