The first noteworthy aspect of the recent Beth Israel study is the speed at which its conclusions were condensed into a single sentence. AI outperformed physicians. The line that moved was that one. Instagram reels, Substack newsletters, LinkedIn posts from doctors with lengthy resumes, and a Vox article that fell somewhere between breathless and cautious were all included. As I read the coverage, I get the impression that we’ve all missed the important part.
The study, which was published in Science on April 30, used actual emergency room records from 76 Beth Israel patients to test OpenAI’s o1-preview against two doctors. not carefully chosen case studies from a publication. The kind of disorganized charts that are written under fluorescent lights at two in the morning between a workup for chest pain and an inebriated patient who won’t leave. In terms of case management, triage, and diagnostic reasoning, the AI either matched or outperformed the physicians. One of the co-authors, Adam Rodman, told NPR that the model’s ability to handle the chaos of actual ER data was what most impressed him. That is a significant discovery. It’s also not the same as declaring that machines are prepared to take over.
| Subject | Details |
|---|---|
| Study Title | Superhuman performance of a large language model on the reasoning tasks of a physician |
| Published In | Science |
| Publication Date | April 30, 2026 |
| AI Model Tested | OpenAI’s o1-preview |
| Lead Institution | Beth Israel Deaconess Medical Center, Harvard Medical School |
| Co-Lead Researcher | Adam Rodman, clinical researcher |
| Collaborating Institution | Stanford University |
| Patient Records Used | 76 real ER patients across three stages of care |
| Physician Diagnostic Accuracy | 50% and 55% (two doctors) |
| Key Finding | AI alone outperformed doctors with AI assistants |
| Coverage Sources | Smithsonian Magazine, Harvard Magazine, Vox, NPR |
Within the paper, there is a more subdued outcome that is hardly mentioned in the headline. Physicians who used AI as a helper did not perform better than those who worked alone. Both groups were defeated by the AI alone. This should have received more attention than it did, as Ed Kalpas pointed out in a LinkedIn post. It’s not a replacement story if the tool only works when people move out of its way. Workflow, ego, interface design, and the peculiar new question of whether or not a doctor should defer to a model that she did not train and is unable to thoroughly audit are all discussed in this story.
Additionally, there is the minor issue of sample size. Two physicians. from the same medical facility. 76 cases were tested against the AI. Within days of the article’s release, a cardiologist on Instagram named @yourheartdoc brought this to light. The medical profession is not two doctors. A system that doesn’t get tired, doesn’t have a sick child at home, and doesn’t have to go into the next room and inform a family that their grandmother won’t be able to make it is up against two people who are having a particularly difficult week.
What keeps getting lost is that final section. One aspect of what doctors do is diagnosis. I once heard from a Cleveland radiologist that she reads actual images for about one-third of the day. Conversations, judgment calls, determining which test is worth the radiation, and contacting a referring physician to follow up on a suspicion make up the remainder. The PMC study from 2026 discovered that AI performed exceptionally well at specific tasks like lesion measurement and image interpretation, which is about where everyone who is serious about this has been pointing for years.

It’s difficult to ignore the pattern. The discourse returns to replacement every few months when a new study demonstrates that an AI model outperforms clinicians on a bounded task. Geoffrey Hinton predicted in 2016 that radiologists would become obsolete by 2021. Today, more radiologists are employed than there were back then. The technology advanced. Around it, the nature of the job changed. Emergency medicine is likely to experience something similar, albeit more slowly than the headlines indicate and more quickly than the institutions are prepared for.
The o1 study actually demonstrates how important it is to distinguish between clinical practice and diagnostic accuracy. When the two are confused, policy decisions, hospital budgets, and patient expectations are all based on misconceptions. You’ll get something more intriguing if you keep them apart. A doctor who still needs to enter the room is given a tool that could detect the uncommon diagnosis that a weary resident overlooked at four in the morning.

