HealthBench: OpenAI’s Medical AI Benchmark Scores Explained — and What They Mean for Clinical AI
OpenAI describes HealthBench as “a new benchmark designed to better measure capabilities of AI systems for health.” It issues scores based on a set of more than 48,000 criteria written by physicians relevant to the conversation. These conversations may fall into 1 of 7 categories HealthBench has defined, from emergency referrals and health data tasks to asking for context or identifying uncertainty. In addition, each criterion is further graded on factors such as accuracy, clarity and completeness, which includes next-best action recommendations.
In a research paper accompanying the HealthBench release, OpenAI reports “steady initial progress … and more rapid recent improvements” in model performance and safety.
Independent research has been more mixed. One paper says HealthBench “is reliable and aligns well with physician ratings” but notes that it lacks “real-time clinical interaction assessments or measurement of downstream clinical outcomes.” A second paper describes HealthBench as a “significant advancement in medical AI benchmarking” but notes an underrepresentation of rare diseases and an inability to assess longitudinal workflows, “limiting insights into AI’s impact across the complete care continuum.”
Ghane says it’s important to remember that benchmarks such as HealthBench aren’t direct substitutes for real-world evidence. “Scores reflect performance in simulated environments and should be interpreted alongside real-world, local testing, workflow integration and safety,” she says. “Health systems should not rely entirely on benchmarks for deployment decisions; they should be one of many metrics used to inform AI procurement.”
READ MORE: Take advantage of data and AI for better healthcare outcomes.
Enterprise Deployment Considerations: Claude, Gemini and OpenAI
Meanwhile, in recent months, each of the major LLM players has released a set of AI-powered products for hospitals and health systems. Each offering is a bit different, and it’s important for organizations to understand this nuance as they evaluate enterprise-grade AI tools. “What matters most is how a solution performs on your unique patients, context of use, data and workflows,” Ghane says.
Claude for Healthcare. Claude can pull from “industry-standard systems and databases” as well as the National Provider Identifier Registry, the ICD-10 code base and coverage determination databases. Organizations can deploy AI agents for prior authorization and Fast Healthcare Interoperability Resources data exchange, which present options to automate a range of administrative processes.
Gemini 3.0. Aashima Gupta, global director of healthcare for Google Cloud, suggests in a LinkedIn post that Gemini’s differentiator is multimodality, or the ability to bring together “text, voice, images, waveforms, scans, genomics data, clinical guidelines, and operational data.” This can be used to support next-best action recommendations. Gemini 3.0 also includes AI agents for automating workflows across business applications.
Click the banner below to sign up for HealthTech’s weekly newsletter.
