评测定义

了解 ESL-Bench 的五维度评测体系、评分协议与 Agent 接入方式

评分体系

Total Score = Σ(L_i × W_i)

W = [0.20, 0.20, 0.20, 0.20, 0.20]

统一两阶段评分

所有维度采用统一的两阶段评分协议：先通过程序化检查（JSON schema 解析 + Ground Truth 比对，数值容差匹配），再经 LLM Rubric 评估答案质量。

加权汇总

各维度权重相同（均为 0.20），总分 = Σ(D_i × W_i)，确保每个评测维度对最终得分贡献均等。

确定性 Ground Truth

所有标准答案由导出的结构化数据（Profile、Device Stream、Exam Visits、Event Log）程序化计算，无需外部临床知识，确保可验证性。

版本一致性

同一数据集版本下的评测结果直接可比。月度大更新后自动触发所有 Agent 重新评测。

五维度评测体系

Lookup

直接数据检索

查询用户档案属性、指定日期的设备指标值、体检结果和事件属性。例如「该用户 2024-03-15 的静息心率是多少？」。考察 Agent 对结构化健康数据的基本检索能力，包含对抗性指标对以防止捷径。

占比

20%

Trend

时序趋势分析

对指标进行月度聚合、变化率计算、连续趋势识别和 Regime 变化检测。例如「哪个月的静息心率环比变化最大？」。需要 Agent 保持完整的时序上下文，基于检索的方法因切割时间线而表现不佳。

占比

20%

Comparison

跨事件或跨数据源的对比分析

计算事件前后指标差异、比较多事件间的共享指标、进行严重度排序。例如「started jogging routine 事件前 14 天 vs 事件期间的平均步数变化？」。需要跨事件 Join 和精确的时间窗口对齐，文本检索方法因证据分散而落后。

占比

20%

Anomaly

异常检测与追踪

检查指标是否超过阈值、统计异常连续天数、识别多指标异常聚类。例如「空腹血糖是否曾被标记为异常？哪些体检中空腹血糖和 HbA1c 同时异常？」。需要 Agent 进行多条件过滤和跨体检比对。

占比

20%

Explanation

因果归因与证据组织

从指标的基线偏移出发，识别贡献事件并按影响量级排序，组织完整的证据链。例如「该用户空腹血糖从基线下降，请识别贡献事件、按贡献排序并组织支持证据」。这是最具挑战的维度，需要机制恢复和证据排序能力。

占比

20%

论文 & 引用

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent

Evaluating longitudinal health agents requires benchmarks with multi-source trajectories and deterministic ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark that provides 20 synthetic users, each with a 1–5 year health trajectory comprising a structured profile, daily device measurements, periodic exam records, and an event log carrying explicit per-indicator impact parameters. Each user is paired with 100 evaluation queries organized along five dimensions — Lookup, Trend, Comparison, Anomaly, and Explanation — and stratified into Easy, Medium, and Hard tiers. Because every event–indicator relationship is recorded with explicit temporal parameters, all ground-truth answers are programmatically computable.

BibTeX 引用格式bibtex

@article{eslbench2026,
  title={ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agent},
  author={Shanda Group},
  journal={arXiv preprint arXiv:2604.02834},
  year={2026}
}