Evaluation Framework

Evaluation framework fully open-source on GitHub, developers can reproduce evaluation results independently

Quick Start

Setupbash

# Clone Repository
git clone https://github.com/healthmemoryarena/holyeval.git
cd holyeval

# Install Dependencies
uv sync

# Configure environment variables
cp .env.example .env  # OPENAI_API_KEY / GOOGLE_API_KEY

ESL-Benchbash

# Run Benchmark — ESL-Bench (sample 50)
python -m benchmark.basic_runner eslbench sample50-20260331 \
  --target-type llm_api \
  --target-model gpt-4.1

Evaluation Details

Pick a dataset to see its evaluation dimensions, scoring protocol, and difficulty design.

Evaluation Dimensions

Cases cover 5 evaluation dimensions, from data lookup to causal explanation:

LOOKUP20%

Direct data retrieval: query user profile attributes, device indicator values on specific dates, exam results, and event properties. Includes adversarial indicator pairs (e.g., blood WBC vs. urine WBC, hs-CRP vs. CRP) to prevent shortcuts.

What was the user's resting heart rate on 2024-03-15?TSH value and reference range from an exam?Which event affected the most indicators?

TREND20%

Temporal trend analysis: including monthly aggregation, rate of change computation, consecutive trend identification, volatility analysis, and regime change detection. Tests agent's pattern recognition on temporal data.

In which month did resting heart rate show the largest month-over-month change?Best/worst quarter means for an indicator?At which date did an indicator exhibit a regime change?

COMPARISON20%

Cross-event or cross-source comparison: pre/post event indicator changes, shared indicator overlap between events, and severity ranking. Text retrieval methods begin falling significantly behind at this dimension.

Step count change before vs. after started jogging?Which events share affected indicators with high-sodium diet?Multi-event severity ranking comparison?

ANOMALY20%

Anomaly detection and tracking: threshold exceedance checks, abnormal streak counting, multi-indicator abnormal cluster identification, and cross-exam deterioration trend tracking. Tests agent's ability to detect and track health anomalies.

Has fasting glucose ever been marked abnormal?Which exams had both fasting glucose and HbA1c marked abnormal?Multi-indicator abnormal cluster and causal chain analysis?

EXPLANATION20%

Causal attribution and evidence organization: identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize evidence chains. Uses programmatic checks + rubric hybrid scoring.

Contributing events and ranking for fasting blood glucose decrease?Counterfactual: expected indicator change if an event were removed?Evidence chain-based attribution explanation for indicator changes?