OpenAI Healthbench - Evaluating your AI Doctor
HealthBench, a new benchmark designed to provide a more meaningful, trustworthy, and unsaturated evaluation of AI systems, particularly large language models, in realistic health scenarios. Developed in collaboration with 262 physicians, HealthBench comprises 5,000 diverse, multi-turn, and multilingual health conversations simulating interactions between AI and users or clinicians. A key technical aspect is its rubric-based evaluation system, where physician-authored criteria with weighted points are used to grade model responses, with assessment performed by a model-based grader (GPT-4.1). This rigorous approach aims to overcome the limitations of existing health evaluations and provide a robust benchmark to drive continuous improvement in AI for health applications. Key Points for a Technical Audience
- Large-Scale, Diverse Dataset: HealthBench provides a dataset of 5,000 realistic, multi-turn, and multilingual health conversations, capturing a wide range of medical specialties, contexts, and user/provider personas, offering a rich testbed for AI models.
- Fine-Grained Rubric Evaluation: The benchmark utilizes a detailed rubric-based evaluation system with 48,562 unique, physician-written criteria, allowing for a granular assessment of model responses against specific medical standards and priorities.
- Model-Based Grading: Evaluation of model responses against the extensive rubric criteria is performed by a model-based grader (GPT-4.1), providing an automated yet detailed assessment mechanism.
- Designed for Model Improvement: HealthBench is intentionally designed to be “unsaturated,” meaning current state-of-the-art models do not achieve perfect scores, providing significant room and clear signals for model developers to identify weaknesses and improve performance.
- Focus on Realistic Interactions: The benchmark moves beyond simple question-answering to evaluate AI capabilities in more complex, realistic health conversations, assessing aspects like context-seeking, response depth, expertise-tailored communication, and handling uncertainty.