The Harvard study contains a moment that is difficult to forget. A patient with a blood clot in their lungs shows up at the emergency department. Things get better at first, but then they don’t. Based on years of experience, training, and standing in fluorescent-lit hallways reading vital signs at odd hours, the medical staff has a reasonable suspicion that the medication has stopped working. The AI then intervenes. It looks through the electronic health record, identifies a history of lupus, and speculates that inflammation rather than a failing medication may be the true cause. As it happens, it is correct. The medical professionals had overlooked it.
Beneath a larger study that was published in the journal Science, that one case captures something both fascinating and subtly unsettling about the direction of medicine. In 67% of emergency triage cases analyzed at Boston’s Beth Israel Deaconess Medical Center, OpenAI’s o1 reasoning model found the correct or closely related diagnosis. The two seasoned medical professionals it was tested against? Their score was in the range of 50% to 55%. The gap is not very small. That’s a significant one, particularly in an emergency room where a life can be lost due to a mistaken call.

The Harvard Medical School researchers were forthright in their findings. They stated that the majority of clinical reasoning benchmarks have been surpassed by large language models. It’s the kind of language that gets lost in the cacophony of AI hype cycles, but it’s worth reading more slowly. These weren’t textbook cases or simulated patients designed to appease an algorithm. These were actual patients who showed up at a real Boston hospital with actual, disorganized, and incomplete records—exactly the kind of disjointed data that makes emergency medicine so incredibly challenging.
What transpired when additional data became available is what makes the conclusions more startling. The AI’s diagnostic accuracy increased to 82%. It was between 70% and 79% for human doctors. For humans, it’s still impressive, but it always lags. Additionally, the AI scored 89% in a different study arm that involved treatment planning across five comprehensive clinical cases, compared to a baseline of 34% for 46 doctors using traditional resources. That final figure ought to spark more public discussion than it has.
One of the study’s lead authors, Dr. Adam Rodman, has been cautious in how he presents all of this. He thinks that rather than taking the place of doctors, AI will work alongside them in what he refers to as a “triadic care model” in which the patient, the doctor, and an AI collaborate. That is a well-considered vision. Additionally, the framework required to actually implement AI in this way—safely, responsibly, and fairly—does not yet exist, so it is currently somewhat aspirational. Rodman admitted that there isn’t currently a formal accountability system in place. That admission should be given more consideration than it is.
The performance of the AI with elderly patients, non-English speakers, or anyone whose medical history does not translate well into structured text is what the study did not test. Additionally, it did not measure what happens when a clinician unintentionally relies on the AI’s response rather than thinking for themselves—a dynamic that University of Sheffield researchers identified as a genuine and growing concern. The idea of a tool intended to improve diagnostic thinking progressively undermining the habit of independent thought is unsettling.
AI is already being used by nearly one in five American doctors to help with diagnosis. That figure is increasing. The adoption of technology is outpacing the construction of safeguards, a trend in technology that America has previously witnessed and seldom managed with grace. The Harvard results are truly remarkable. They imply that something genuine is taking place in clinical AI, not a staged demonstration or a showpiece. However, deployment-ready safety and impressive results are not the same thing, and it’s important to make this clear before the gap between them silently narrows.

