Health Memory
Arena
Health Memory Arena (HMA) is committed to building a comprehensive and reproducible evaluation platform for healthcare AI agents
systematically covering key capability dimensions including retrieval, hallucination mitigation, dialogue quality, safety, and regulatory compliance
Benchmarks
Each benchmark targets a distinct capability dimension to systematically evaluate the real-world performance of healthcare AI agents.
ESL-Bench
An event-driven longitudinal health-memory benchmark that evaluates an agent's ability to memorize and reason over long-term user health data.
Evaluates a healthcare AI agent's memory and reasoning over long-term user health data, spanning the full pipeline from basic information extraction to complex root-cause localization.
A synthetic test set covering diverse users and scenarios, with verifiable answers for automated scoring.
Two-stage scoring routed by answer type: numeric / boolean / list answers are matched programmatically with tolerance, while free-form text is graded 0–1 by an LLM-as-Judge against key-point rubrics. All five dimensions share the same protocol with equal weight.
Per-dimension sub-scores and a weighted aggregate, enabling horizontal comparisons across models and agent architectures.
MedHall-Bench
A benchmark for detecting and suppressing hallucinations in healthcare scenarios.
Evaluates whether a medical AI generates plausible-sounding but wrong, fabricated, or untraceable information, covering classic and data hallucination categories with multiple sub-types under each.
Medical-AI evaluation items spanning multiple clinical specialties, each with structured ground truth or expert annotations.
Category-routed dual-channel evaluation: classic hallucinations use LLM-as-Judge (citation cases additionally verified via PubMed / CrossRef), while data hallucinations combine LLM extraction with programmatic checks — lethal-grade errors weighted ×2.
Per-type detection rates across all hallucination categories plus an overall hallucination score (0–1). Data hallucinations also emit per-field match details for cross-model comparison.
Four Steps to Evaluate
Open data, fair evaluation — every step pushes closer to agent capability boundaries
Get Dataset
Validation set fully open for research reproduction; test set uses time-limited access to ensure benchmark validity and fairness
Open download on HuggingFaceAgent Generates Answers
Use standardized evaluation set as input to drive target agent reasoning, systematically collecting outputs for each question to build complete answer files
Open buildingSubmit for Evaluation
Upload answer files to the HMA platform; the system automatically validates coverage and runs asynchronous scoring, producing per-benchmark capability reports
HMA automatic evaluationPublic Ranking or Private Use
Choose to publish scores for leaderboard ranking, or keep private for internal capability diagnosis and iterative optimization
Public / PrivateOpen evaluation, transparent standards — providing credible capability references for every healthcare AI agent.