Most emergency rooms have a time between the intake form and the initial blood draw when a nurse looks at a patient and determines how soon a doctor should arrive. That assessment, which was based on years of witnessing suffering, has always felt fundamentally human. It is becoming more difficult to maintain that assumption in light of a recent Harvard study.
The study, which was published in the journal Science in late April, discovered that OpenAI’s o1 reasoning model performed better than emergency room doctors at clinical diagnostic reasoning—not on a sanitized multiple-choice test, but under circumstances meant to replicate the actual chaos of triage. 76 actual patients. actual electronic medical records. In a Boston hospital, a nurse would give a doctor the same intake information. In 67% of cases, the AI made the right or extremely close diagnosis. Fifty to fifty-five percent were managed by human physicians.

It’s hard to ignore those figures. It turns out that they are also hard to completely trust—not because the study is flawed, but rather because of what it purposefully was unable to measure.
The model relied solely on text. Vital signs, demographic information, and a nurse’s succinct explanation of the patient’s visit. Information that is neatly organized and typed. It failed to notice the patient’s face turning pale in the middle of a sentence. The slight tremor in a person’s hand that a doctor might notice when taking their pulse was not picked up by it. It was unable to detect the specific type of agitation that occasionally indicates a neurological rather than psychological issue. The fact that medicine existed in the body long before it existed in the chart still has a significant impact.
The AI’s performance at the earliest, least-informed stage of triage is what makes the study noteworthy and has undoubtedly alarmed some researchers. Most people would think that better AI performance is correlated with more information. Furthermore, even though accuracy increased to 82% as more information became available, the difference with human physicians became insignificant. In fact, the machine had its sharpest edge early on, when data was scarce and uncertainty was at its highest. That is surprising. It’s possible that the AI’s advantage stems from the fact that it isn’t overtaken by uncertainty the way a weary human physician might be at three in the morning on a Tuesday.
The case studies pertaining to uncommon illnesses were especially instructive. These cases, which were published in the New England Journal of Medicine, are the type of diagnostic conundrums that have plagued physicians and medical students since the 1950s. The AI’s performance on those cases, according to senior co-author Arjun Manrai, “shocked a lot of folks.” It is easy to understand why. Human pattern recognition fails in rare diseases, but systems trained on massive amounts of medical literature may silently flourish.
A version of this narrative already exists that contributes to a particular kind of techno-optimism, portraying AI as an unbiased, tireless diagnostic partner that silently saves lives in underfunded emergency rooms or remote hospitals where access to specialists is a luxury. It’s worthwhile to imagine that version. Slowing down is also worthwhile.
AI is already being used by nearly one in five American doctors to help with diagnosis. The frameworks to control them are not keeping up with the speed at which the tools are emerging. Legally and morally, it is still unclear who is responsible for AI errors in clinical settings. Additionally, the authors of the study took care to note that patients want people to help them through their most difficult times. A diagnosis is still just a number if it is given without a helping hand.
Observing this study in the medical community gives me the impression that something truly significant has changed—not when AI takes the place of doctors, but rather when we can no longer claim it wasn’t qualified to try. What happens next is more dependent on whether medicine has the institutional will to determine where the machine belongs and where it doesn’t than on the technology itself.

