Health Memory
Arena

Health Memory Arena (HMA) is committed to building a comprehensive and reproducible evaluation platform for healthcare AI agents
systematically covering key capability dimensions including retrieval, hallucination mitigation, dialogue quality, safety, and regulatory compliance

Leaderboard

Dataset →GitHub →

Benchmarks

Each benchmark targets a distinct capability dimension to systematically evaluate the real-world performance of healthcare AI agents.

ESL-Bench

An event-driven longitudinal health-memory benchmark that evaluates an agent's ability to memorize and reason over long-term user health data.

What it tests

Evaluates a healthcare AI agent's memory and reasoning over long-term user health data, spanning the full pipeline from basic information extraction to complex root-cause localization.

How it's measured

A synthetic test set covering diverse users and scenarios, with verifiable answers for automated scoring.

Evaluation pipeline

Two-stage scoring routed by answer type: numeric / boolean / list answers are matched programmatically with tolerance, while free-form text is graded 0–1 by an LLM-as-Judge against key-point rubrics. All five dimensions share the same protocol with equal weight.

What you get

Per-dimension sub-scores and a weighted aggregate, enabling horizontal comparisons across models and agent architectures.

MedHall-Bench

A benchmark for detecting and suppressing hallucinations in healthcare scenarios.

What it tests

Evaluates whether a medical AI generates plausible-sounding but wrong, fabricated, or untraceable information, covering classic and data hallucination categories with multiple sub-types under each.

How it's measured

Medical-AI evaluation items spanning multiple clinical specialties, each with structured ground truth or expert annotations.

Evaluation pipeline

Category-routed dual-channel evaluation: classic hallucinations use LLM-as-Judge (citation cases additionally verified via PubMed / CrossRef), while data hallucinations combine LLM extraction with programmatic checks — lethal-grade errors weighted ×2.

What you get

Per-type detection rates across all hallucination categories plus an overall hallucination score (0–1). Data hallucinations also emit per-field match details for cross-model comparison.

Four Steps to Evaluate

Open data, fair evaluation — every step pushes closer to agent capability boundaries

Get Dataset

Validation set fully open for research reproduction; test set uses time-limited access to ensure benchmark validity and fairness

Open download on HuggingFace

Agent Generates Answers

Use standardized evaluation set as input to drive target agent reasoning, systematically collecting outputs for each question to build complete answer files

Open building

Submit for Evaluation

Upload answer files to the HMA platform; the system automatically validates coverage and runs asynchronous scoring, producing per-benchmark capability reports

HMA automatic evaluation

Public Ranking or Private Use

Choose to publish scores for leaderboard ranking, or keep private for internal capability diagnosis and iterative optimization

Public / Private

Also supports fully local evaluationOpen-source evaluation framework can be deployed locally for dataset loading, answer comparison and scoring — no upload to HMA required.View Framework →

Open evaluation, transparent standards — providing credible capability references for every healthcare AI agent.

Submit

Health MemoryArena