I can’t stop thinking about a scene from the Harvard study. A patient with a blood clot in their lungs shows up at an emergency room in Boston. The symptoms worsen. As they go through the chart, the doctors conclude that the anticoagulants are just not working. This is a reasonable assumption that any skilled doctor would make during a busy shift. However, the AI perceives something different when it scans the same record. The patient suffers from lupus. That could be the cause of the lung inflammation rather than a drug that isn’t working. The AI was correct.
It’s a tiny moment in a big study, but it sums up the anxiety the paper has caused ever since it was published in Science at the end of April. In a series of clinical tasks, OpenAI’s o1 reasoning model was tested against hundreds of physicians by researchers from Harvard Medical School, Beth Israel Deaconess, and Stanford. The AI correctly or nearly correctly diagnosed 67% of the 76 actual emergency room cases in Boston that made up the headline measure. Between 50 and 55 percent were managed by the doctors. The disparity grew uncomfortably when longer-term treatment plans were tested: 89% for the model and 34% for doctors who used search engines and standard reference materials.

These figures have spread quickly. The trial was described as “groundbreaking” by the Guardian. The catch was marked by Vox. The framing has outpaced the evidence, according to a more skeptical article published in the BMJ. And it has. Because the study’s actual findings are more limited than what the coverage implies, and the most important aspects are those it omits.
Think about the setup. Vital signs, electronic health records, and a few sentences from a triage nurse were all being read by the AI. It never had a patient. It missed the wheeze, the grey skin, and the way a person shifts their weight when their abdomen aches in a specific area. It is well known that emergency medicine is a sensory field. Before anyone speaks, a competent attending makes half of a diagnosis as soon as they enter the room. You won’t be testing doctoring if you remove that. Chart review is being tested by you. One of the main authors, Arjun Manrai, has exercised caution in this regard. He has publicly stated that the findings do not imply that AI will take the place of doctors. Instead, his co-author at Beth Israel Deaconess, Adam Rodman, discusses a “triadic care model” (doctor, patient, machine), which sounds more like a quick, sensible second opinion than a revolution.
However, the study does not address any of the three unanswered questions. The first is whether the model works well on populations and presentations that it wasn’t subtly designed for outside of Boston. The second is what happens when doctors and AI actually collaborate instead of competing on paper: does the diagnosis get better or feel more confident, does the machine anchor the team on an incorrect response, or does the human defer too quickly? The awkward one is the third. When the AI makes a mistake, who is responsible? Even though the machine’s accuracy rate of 67% is remarkable compared to a human baseline of 50%, a third of the time it fails. A third is a lot of people in an emergency room.
Something truly remarkable is taking place here, the kind of outcome that, even if ignored, implies the technology has gone beyond what it had previously been able to. According to recent surveys, about one in five US doctors are already using AI for diagnostic assistance, which indicates that clinicians are also aware of it. The trials Manrai is requesting will need to provide an answer, whether that is progress or premature. The paper is not a conclusion, but a starting point.

