The same disorganized, unfinished electronic health records that two seasoned emergency physicians had in front of them were being silently read by an OpenAI reasoning model somewhere within Boston’s Beth Israel Deaconess Medical Center. 76 actual patients. actual notes on triage. In the midst of a chaotic emergency room, nurses scrawled actual vital signs. Additionally, the AI correctly diagnosed 67% of the time, according to the results, which were published this spring in the journal Science. Fifty and fifty-five percent were handled by the doctors, respectively.
At its largest, the difference was seventeen percentage points, so it is not insignificant. It’s the kind of figure that causes hospital administrators to start setting up meetings they hadn’t planned for the previous quarter and drives venture capital funding into health-tech startups. The discovery made headlines all over the world. Social media took off. In just one day, Hashem Al-Ghaili’s post received hundreds of reactions. Every time a machine outperforms a human at something we believed required human intuition, there’s a certain electricity that permeates public discourse.

However, a detail that completely alters the conclusion was buried deep in most of the coverage—sometimes not mentioned at all: the AI only read text. It never saw the face of a patient. It never detected the flicker of bewilderment in an elderly person’s eyes, it never heard labored breathing, and it never saw how someone grimaced while shifting on a gurney. The researchers themselves admitted that the model functioned more like a clinician providing a second opinion based on paperwork than one standing at a patient’s bedside, processing only electronic health records.
This is very important, and it’s odd how quickly it can be forgotten. Multisensory overload is the foundation of the field of emergency medicine. In the emergency room, doctors can distinguish between a panic attack and a pulmonary embolism by observing gait, listening to breath sounds, and reading skin tone. It would be akin to comparing a chess engine to a grandmaster who is also taking phone calls if the comparison were reduced to “AI versus doctors” without acknowledging that the doctors were limited to text-only inputs.
However, one of the study’s cases is truly remarkable. A patient with worsening symptoms and a blood clot in the lungs showed up. It was suspected by the treating physicians that the anticoagulants were not working. When the AI scanned the records, it discovered something that the humans had missed: a history of lupus that might be contributing to inflammation of the heart. It was correct. Such moments are difficult to ignore and suggest a real, useful future in which AI serves as “a second set of eyes,” catching the zebras that weary doctors miss at three in the morning, according to one commentator.
It was thoughtfully framed by lead author and Beth Israel clinician Adam Rodman. He explained an upcoming “triadic care model” in which the patient, the physician, and an AI system collaborate. He emphasized that patients still want a human to help them make decisions that could mean the difference between life and death. It’s possible that the framing is perfectly appropriate, but it’s also possible that it’s just optimistic packaging for a technology whose integration into real clinical workflows is still truly unexplored. The most important question was posed by David Reich, chief clinical officer at Mount Sinai: how do you actually incorporate this into care in ways that enhance outcomes?
Additionally, there are darker undercurrents. Wei Xing from the University of Sheffield brought up the issue of automation bias, which is when medical professionals unintentionally follow an AI’s recommendation rather than using their own judgment. Whether the model performs worse with elderly patients, non-English speakers, or infrequent presentations is still unknown. Rodman acknowledged that there is currently no accountability framework. AI is already being used by nearly one in five American doctors for diagnostic purposes, outpacing any regulatory framework intended to detect mistakes.
There’s a familiar tension as you watch this play out. The technology is truly amazing, capable of doing something that seemed unattainable five years ago. Nevertheless, sixty-seven percent indicates that one in three patients would have received an inaccurate or insufficient diagnosis from the machine, even though it was better than the humans in this specific test. No one should be so reassured by that figure that they begin to replace the scrubs. It’s not true that AI outperformed doctors. The problem is that no one has figured out what happens when the machine encounters the complete, perplexing complexity of a human body in crisis, and the test was far more limited than the job.

