Something was silently lost somewhere between the Twitter reposts and the Harvard press briefing. the real research. the real figures. The real restrictions were clearly visible on page three of a Science journal article that the majority of people who shared the headline most likely never opened.
According to reports from Gizmodo and a number of other publications in late April, the story went something like this: When compared to actual ER patient records, OpenAI’s o1-preview reasoning model achieved 67.1% diagnostic accuracy, whereas two physician comparators achieved 55.3% and 50.0%. Eleven points separated them. The AI prevailed. Any med-tech startup that can put “powered by AI reasoning models” on their pitch deck will inevitably attract the attention of venture capitalists, think pieces, and LinkedIn posts.
Here, slowing down is worthwhile. Because depending on which side of the exam table you are on, the difference between what the study actually tested and what the headlines claimed it tested is crucial.
The AI model and two comparison physicians were given three stages of actual patient chart data by the Harvard and Beth Israel researchers. They asked both to produce a differential diagnosis, which is essentially a ranked list of five potential explanations for the patient’s condition. In a total of 76 cases, every patient was eventually admitted to the intensive care unit or medical floor. It sounds strong. However, the two medical professionals who served as the human benchmark were internal medicine specialists. not doctors in emergency rooms. The human comparators in the experiment were not ER specialists, but they were asked triage questions similar to those used in emergency rooms. It’s not a footnote. Within a day of its publication, the doctor Kristen Panthagani’s “You Can Know Things” newsletter pointed out that design decision, which has significant interpretive ramifications.
Watching this unfold is frustrating in a way that isn’t aimed at the researchers, who were remarkably cautious when using their own language. “I don’t think our findings mean that AI replaces doctors, despite what some companies are likely to say,” stated study coauthor Arjun Manrai in an almost preemptive opening statement to the press conference. He was aware. Everyone was aware. Nevertheless, an ecosystem that favors the clearer, more dramatic version of the truth produced the headlines.
Compared to the viral version, the study’s true findings are more intriguing and complex. On real-world, messy clinical data, OpenAI’s o1 model—dubbed a “reasoning” model because it solves problems in structured, sequential steps before arriving at an answer—performed noticeably well. not carefully chosen case studies. Not vignettes from a textbook. Real emergency department charts with all the hurried nursing notes, incomplete vital signs, and condensed histories that come with a crowded urban hospital. That is important. For many years, detractors claimed that AI diagnostic tools only worked well with clean, controlled datasets. This study presented a credible challenge to that assumption.

However, because it was not mentioned in the headlines, the hallucination issue persisted. During the briefing, lead researcher Thomas Buckley admitted that the team did not formally measure the model’s hallucination rate. They are aware of O1’s hallucinations. They simply failed to quantify it in this instance. Even more concerning, according to Yujin Potter, an AI safety researcher at Berkeley who reviewed the study for Gizmodo, coordinated AI systems can develop and act on misaligned goals, deceiving human users and hiding information, according to her own March research. That is a reminder that the tool being praised is part of a larger landscape of unsolved issues rather than a direct criticism of this specific study.
None of this is particularly comforting to clinicians who are already concerned about AI being positioned as a cost-cutting tactic rather than a tool for patient care. Before using AI-generated diagnoses for his own patients, physician Adam Rodman, a coauthor of the study who actually practices internal medicine at Beth Israel, stated he would need evidence from randomized controlled trials. That isn’t opposition to advancement. Responsible medicine looks like that.
It’s still unclear if any of this improves patient outcomes, which should be the only metric that matters in the end. A model that saves lives in a busy emergency room at two in the morning is not the same as one that correctly diagnoses a patient 67% of the time in a controlled experiment. The real work starts in the space between those two things. It appears that investors think the gap is smaller than it actually is. The majority of seasoned medical professionals think it’s bigger. The uncomfortable middle is probably where the truth lies, which is undoubtedly a much more difficult headline to write.

