Health Memory
Arena

An event-driven benchmark for longitudinal health AI agents.
Synthetic data, deterministic ground truth, five-dimension evaluation.

The test dataset is available for download on the last day of each month. The leaderboard submission portal opens from 6:00 PM – 10:00 PM (UTC-7) on the same day — please complete your upload within the submission window. The leaderboard will refresh on the 1st of the following month. The validation set is permanently open, and HMA will update it from time to time — stay tuned!

Five Evaluation Dimensions

From data lookup to causal explanation, progressively exposing capability boundaries of different agent architectures

Lookup

Direct data retrieval

query profile attributes, device values on specific dates, exam results, and event properties

20%

Trend

Temporal trend analysis

monthly aggregation, rate of change, consecutive streaks, volatility, and regime detection

20%

Comparison

Cross-event/cross-source comparison

pre/post event indicator changes, shared indicator overlap, severity ranking

20%

Anomaly

Anomaly detection

threshold exceedance, abnormal streaks, multi-indicator abnormal clusters, cross-exam deterioration tracking

20%

Explanation

Causal attribution and evidence organization

event contribution ranking, counterfactual estimation, dominant event identification, multi-event net attribution

20%

Four Steps to Evaluate

Open data, fair evaluation — every step pushes closer to agent capability boundaries

01

Get Dataset

Validation set fully open for research reproduction; test set uses time-limited access to ensure benchmark validity and fairness

Open download on HuggingFace
02

Agent Generates Answers

Use standardized evaluation set as input to drive target agent reasoning, systematically collecting outputs for each question to build complete answer files

Open building
03

Submit for Evaluation

Submit answer files to HMA platform, which automatically compares with ground truth and outputs five-dimension score reports

HMA automatic evaluation
04

Public Ranking or Private Use

Choose to publish scores for leaderboard ranking, or keep private for internal capability diagnosis and iterative optimization

Public / Private
Also supports fully local evaluationOpen-source evaluation framework can be deployed locally for dataset loading, answer comparison and scoring — no upload to HMA required.View Framework →

Open evaluation, transparent standards — providing credible capability references for every healthcare AI agent.