Capabilities

OpenAI opens free
clinician workspace, claims
GPT-5.4 beats physician
baselines on its
own new benchmark

ChatGPT for Clinicians launches free for verified U.S. physicians, NPs, PAs and pharmacists, with cited search and CME credits — alongside HealthBench Professional, an open benchmark OpenAI says its GPT-5.4 workspace tops against human physician responses.

By Abigail Pemberton Capabilities & evals · June 18, 2026

OpenAI on Wednesday launched ChatGPT for Clinicians, a free GPT-5.4-powered workspace open to verified U.S. physicians, nurse practitioners, physician assistants, and pharmacists, alongside a new open benchmark it says proves the workspace outperforms not only every other model tested but human physicians themselves.

The free tier is the headline, but the more revealing move is methodological. Verification runs through National Provider Identifier checks at signup, per MobiHealthNews, with HIPAA support and training-data exclusion bundled in. The product sits downstream of ChatGPT for Healthcare, the enterprise workspace OpenAI rolled out earlier this month to systems including Boston Children’s Hospital, Cedars-Sinai, Stanford Medicine Children’s Health, HCA Healthcare, and Memorial Sloan Kettering. Features include cited search across peer-reviewed literature, reusable workflow “skills,” deep research, and CME credits, a stack engineered to make ChatGPT legible to hospital procurement and credentialing offices rather than just curious residents.

The benchmark is where the narrative management gets interesting. HealthBench Professional, which OpenAI is releasing openly, splits into three task categories: care consult, writing and documentation, and medical research. Roughly one third of examples involved physicians red-teaming the models, and the dataset was filtered to be about 3.5x harder than typical clinician chats. Human physicians answering with full web access form the baseline. OpenAI says GPT-5.4 inside the clinician workspace beats base GPT-5.4, every external model tested, and that physician baseline.

It’s a benchmark designed by the company, scored by the company, won by the company. That isn’t disqualifying, but it’s the structural pattern worth naming.

The supporting numbers are large. Pre-release testing covered 6,924 conversations, with 99.6% of responses rated safe and accurate, and a 355-example subset where three independent physicians specified ground-truth citations. Ongoing physician review now spans more than 700,000 responses, with a new one reviewed “every few minutes.” Earlier evaluation work drew on 260 clinicians and 600,000 reviewed outputs.

Peter Bonis, chief medical officer at Wolters Kluwer Health, told Fierce Healthcare that such evaluation work “may not be nearly sufficient to anticipate and correct the spectrum of gross and subtle errors that can occur with AI.” There’s, still, no agreed standard for clinical validity in generative systems, which is precisely the vacuum HealthBench Professional walks into.

The install base is already there. A 2026 AMA survey cited by OpenAI puts physician AI use in clinical practice at 72%, up from 48% the prior year; Fierce Healthcare reads the same data as 81%. Either way, adoption ran ahead of benchmarking, and the vendor with the largest deployment is now also writing the test it grades itself on.

Sources