Between the initial triage note and the attending physician’s first actual examination of a patient, there is a point in emergency medicine where the diagnosis may be made in a number of ways. It is disorganized, lacking, and occasionally contradictory. The chart is a mess of inconsistent vitals, nurse observations, and timestamps. Navigating that maze has long been thought to be the sole responsibility of qualified human doctors. It might no longer be the case.
A large language model, specifically from the OpenAI o1 series, outperformed doctors in a startling range of clinical reasoning tasks, including actual emergency department cases that were taken straight from electronic health records without any cleaning or pre-processing, according to a study published in Science this past April. The baseline consisted of hundreds of clinicians. Most of them were defeated by the AI. Even though no one is quite sure what to do with it yet, there’s a feeling that something truly changed when you watch that result land in the medical community.

It wasn’t just the result that set the Harvard Medical School and Beth Israel Deaconess research apart. It was the arrangement. Previous AI medical research has frequently come under fire for providing models with idealized data, such as neat symptom lists and polished case summaries. That was something that this team purposefully avoided. Patient data was fed into the model exactly as it appeared in the medical record. Not finished. unclear. Actual. At the earliest stages of decision-making, when clinical data was at its most limited, it continued to match or surpass attending physicians.
In the words of co-first author Peter Brodeur, “the old benchmarks are broken.” AI systems are scoring close to the ceiling, making multiple-choice medical exams, which were once thought to be a reasonable proxy for clinical competence, practically useless for assessing these models. The field has been using an excessively short ruler to measure progress.
Even though that performance was amazing, it makes sense for the celebration to end there. The researchers make it clear—almost insistent—that being able to outperform a doctor on a reasoning task does not equate to being prepared to practice medicine. A model might accurately determine the most likely diagnosis while also suggesting a number of pointless tests that could actually endanger a patient. There is currently no trustworthy method to quantify the difference between clinical safety and clinical accuracy on a large scale.
Whether any health system is truly ready for what a rigorous clinical trial of AI-assisted care would even entail is still up for debate. Before any broader deployment, the researchers demand prospective trials, which are the same gold standard used for medications and surgical procedures. That framing is important. Instead of viewing AI as a software update, it views it as a medical intervention.
The medical AI debate seems to have been moving a little ahead of the evidence for years, making headlines more quickly than it produces controlled data. In some ways, this study is a course correction because the researchers who are celebrating are also the ones applying the brakes, rather than because it diminishes what the model achieved. Adam Rodman acknowledged that he thought the experiment in the emergency room would fail. It didn’t. And the thing that shocked him the most was that.
The machines’ ability to read the chart is improving dramatically. Whether that makes patients safer or merely improves the chart’s appearance on paper has not yet been established.

