World’s First AI Disease-Specific Evidence-Based Evaluation Framework GAPS Released

World’s First AI Disease-Specific Evidence-Based Evaluation Framework GAPS Released

Published:December 30, 2025
Reading Time:2 min read

Want to read in a language you're more familiar with?

Researchers from Ant Health and Peking University have jointly released GAPS, the world’s first evaluation framework dedicated to assessing AI's evidence-based clinical capabilities in specific diseases, starting with lung cancer (NSCLC)

Ant Health and Academician Wang Jun’s Team at Peking University Collaborate Ant Health, together with the team led by Academician Wang Jun at Peking University People's Hospital, has jointly unveiled GAPS (Grounding, Adequacy, Perturbation, Safety)—the world's first evaluation framework dedicated to assessing large language models' evidence-based capabilities in specific diseases. Alongside the framework, the team also released the accompanying benchmark dataset GAPS-NSCLC-preview.

The initiative addresses a long-standing limitation in medical AI evaluation, which has relied heavily on "exam-style" Q&A and lacks systematic assessment of clinical depth, completeness, robustness, and safety.

The initial benchmark focuses on non-small cell lung cancer (NSCLC) and comprises 92 questions covering 1,691 clinical decision points, supported by a fully automated evaluation toolchain. By combining guideline-anchored question generation with multi-agent collaboration, the research team achieved end-to-end automation—from question creation and scoring rubric design to multi-dimensional evaluation. All related papers, datasets, and framework details have been publicly released.

GAPS decomposes clinical competence into four orthogonal dimensions: Grounding (G) – depth of clinical reasoning beyond factual recall Adequacy (A) – completeness of the response Perturbation (P) – robustness under uncertainty or conflicting evidence Safety (S) – adherence to non-negotiable clinical safety boundaries

Notably, the safety dimension introduces a strict "zero-tolerance" rule: any catastrophic or harmful clinical recommendation results in an automatic zero overall score.

The project followed a clear division of labor: Wang Jun's clinical team defined medical standards, while Ant Health handled engineering and system implementation, forming a "clinician-sets-standards, AI-enables-scale" collaboration model. The results have been applied to "AQ" (Ant’s A-Fu). Using GAPS, the team evaluated several leading models, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4. Results show that while models perform well on factual recall, their performance drops sharply on higher-order tasks requiring uncertainty reasoning and clinical decision-making.

The release of GAPS marks a critical shift in medical AI evaluation—from optimizing for "test scores" to assessing true clinical competence.

Source:QbitAi