Journalists adore a certain type of medical story, which typically features a quiet hospital hallway, a confused physician, and a last-minute rescue. That scene is completely reversed in the new Boston-based Science paper. This time, while the human team is still adjusting the anticoagulation drip, a model sitting somewhere in an OpenAI data center reads clumsy electronic health records and silently arrives at lupus.
It’s worth reading twice in part because Dylan Scott’s Vox piece doesn’t oversell it. On their own, the numbers are powerful. OpenAI’s o1 model correctly diagnosed patients in triage 67% of the time. The scores of the two physicians who tested against it were 50 and 55. The model increased to 81 by admission. To their credit, the doctors finished at 70 and 79, significantly closing the gap. Nevertheless, the headline speaks for itself, and it was written by many media outlets.
| Keys | Values |
|---|---|
| Study Title | Reasoning model performance in emergency department diagnosis |
| Published In | Science, April 30, 2026 |
| Lead Institutions | Harvard Medical School; Beth Israel Deaconess Medical Center |
| Co-Author Quoted | Dr. Adam Rodman, general internist and medical educator |
| AI Model Tested | OpenAI’s o1 reasoning model (released 2024) |
| Triage Accuracy | AI: 67% — Doctors: 50% and 55% |
| Admission-Stage Accuracy | AI: 81% — Doctors: 70% and 79% |
| Data Source | Real ER cases from Beth Israel Deaconess |
| Reporter | Dylan Scott, health correspondent at Vox |
| Key Caution | Authors warn against using results to replace physicians |
| Earlier Counter-Study | Nature Medicine, Feb 2026 — ChatGPT underestimated severity in 52% of cases |
Every thorough article has a catch that is hidden in the second half, and Scott appears to be particularly interested in it. One of the co-authors, Dr. Adam Rodman, told reporters he feels “a little bit queasy” about the potential applications of the findings. That is not a line to be thrown away. As he watches his own paper leave the lab, the researcher worries about what hospital administrators, insurance executives, and tech investors will do with it. Reading between the lines gives the impression that he has already seen the PowerPoint slide that it will eventually become.
In reality, the study measured diagnostic reasoning on paper. cold text. No bedside chatter, no patient recoiling when a resident applies pressure to the lower right quadrant, and no parent discreetly pointing out that their child hasn’t eaten in two days. O1 didn’t see any of the human signal that has always been as important to emergency medicine as labs and imaging. It observed the chart. It performed remarkably well on the chart. It is genuinely unclear if it would perform nearly as well in a real shift with interruptions and incomplete information arriving in waves.
It’s also difficult to ignore the timing. In more than half of test scenarios, ChatGPT, a generalist rather than a reasoning model, underestimated patient severity, according to a different February Nature Medicine paper. In one instance, it advised someone on the verge of diabetic shock to monitor things at home. There was hardly any impact from that study. With its prestigious co-authors and cleaner numbers, the April paper has been widely circulated.

The cultural pattern seems familiar as we watch this develop. Before anyone trusted self-driving cars on a foggy freeway, they passed their closed-course tests for years. By 2020, radiologists were expected to be replaced by AI. The radiologists are still reading scans six years after that deadline, frequently with software assisting them in the background. This also most likely takes that form. Not a substitute. A second pair of eyes that never grows weary, never overlooks lupus on the differential, and never, ever has to explain things to the family in the waiting area.
Before any of this is applied to actual patients, the authors have explicitly requested clinical trials. The more intriguing question is whether anyone pays attention.

