Health Memory
Arena
An event-driven benchmark for longitudinal health AI agents.
Synthetic data, deterministic ground truth, five-dimension evaluation.
The test dataset is available for download on the last day of each month. The leaderboard submission portal opens from 6:00 PM – 10:00 PM (UTC-7) on the same day — please complete your upload within the submission window. The leaderboard will refresh on the 1st of the following month. The validation set is permanently open, and HMA will update it from time to time — stay tuned!
Five Evaluation Dimensions
From data lookup to causal explanation, progressively exposing capability boundaries of different agent architectures
Direct data retrieval
query profile attributes, device values on specific dates, exam results, and event properties
20%
Temporal trend analysis
monthly aggregation, rate of change, consecutive streaks, volatility, and regime detection
20%
Cross-event/cross-source comparison
pre/post event indicator changes, shared indicator overlap, severity ranking
20%
Anomaly detection
threshold exceedance, abnormal streaks, multi-indicator abnormal clusters, cross-exam deterioration tracking
20%
Causal attribution and evidence organization
event contribution ranking, counterfactual estimation, dominant event identification, multi-event net attribution
20%
Four Steps to Evaluate
Open data, fair evaluation — every step pushes closer to agent capability boundaries
Get Dataset
Validation set fully open for research reproduction; test set uses time-limited access to ensure benchmark validity and fairness
Open download on HuggingFaceAgent Generates Answers
Use standardized evaluation set as input to drive target agent reasoning, systematically collecting outputs for each question to build complete answer files
Open buildingSubmit for Evaluation
Submit answer files to HMA platform, which automatically compares with ground truth and outputs five-dimension score reports
HMA automatic evaluationPublic Ranking or Private Use
Choose to publish scores for leaderboard ranking, or keep private for internal capability diagnosis and iterative optimization
Public / PrivateOpen evaluation, transparent standards — providing credible capability references for every healthcare AI agent.