评测框架

评测框架完全开源,发布至 GitHub,开发者可自行复现评测结果

快速开始

安装与运行bash
# 克隆仓库
git clone https://github.com/healthmemoryarena/hma-benchmark.git  # coming soon
cd hma-benchmark

# 安装依赖
uv sync

# 运行评测
python -m benchmark.basic_runner healthbench sample \
  --target-type llm_api \
  --target-model gpt-4o

API 示例

Pythonpython
from evaluator.core.orchestrator import do_single_test
from evaluator.core.schema import TestCase

# 构建 TestCase 并执行评测
result = await do_single_test(test_case)
print(f"Score: {result.score}, Pass: {result.passed}")