Evaluation Framework
Evaluation framework fully open-source on GitHub, developers can reproduce evaluation results independently
Quick Start
# Clone Repository
git clone https://github.com/healthmemoryarena/holyeval.git
cd holyeval
# Install Dependencies
uv sync
# Configure environment variables
cp .env.example .env # OPENAI_API_KEY / GOOGLE_API_KEY# Run Benchmark — ESL-Bench (sample 50)
python -m benchmark.basic_runner eslbench sample50-20260331 \
--target-type llm_api \
--target-model gpt-4.1Evaluation Details
Pick a dataset to see its evaluation dimensions, scoring protocol, and difficulty design.
Evaluation Dimensions
Cases cover 5 evaluation dimensions, from data lookup to causal explanation:
Direct data retrieval: query user profile attributes, device indicator values on specific dates, exam results, and event properties. Includes adversarial indicator pairs (e.g., blood WBC vs. urine WBC, hs-CRP vs. CRP) to prevent shortcuts.
Temporal trend analysis: including monthly aggregation, rate of change computation, consecutive trend identification, volatility analysis, and regime change detection. Tests agent's pattern recognition on temporal data.
Cross-event or cross-source comparison: pre/post event indicator changes, shared indicator overlap between events, and severity ranking. Text retrieval methods begin falling significantly behind at this dimension.
Anomaly detection and tracking: threshold exceedance checks, abnormal streak counting, multi-indicator abnormal cluster identification, and cross-exam deterioration trend tracking. Tests agent's ability to detect and track health anomalies.
Causal attribution and evidence organization: identify contributing events from indicator baseline deviations, rank by impact magnitude, and organize evidence chains. Uses programmatic checks + rubric hybrid scoring.