There are currently two things going on simultaneously in practically every large teaching hospital in Europe. In between rounds, doctors are discreetly experimenting with chatbots, occasionally pasting symptoms into a window on a phone hidden behind a clipboard. A procurement contract for an AI diagnostic tool that no one in the building fully understands is being stared at by an administrator somewhere down the hallway. Those closest to patients have begun to publicly state that the technology has advanced more quickly than the rulebook.
This week, Flinders University researchers made that concern public by reporting in Science that sophisticated reasoning models are now matching, and occasionally outperforming, seasoned medical professionals on text-based diagnostic cases. In 88.6% of published clinicopathological cases, one model, OpenAI’s o1-preview, produced accurate or nearly accurate diagnoses. It outperformed two attending physicians with a triage accuracy of 67.1% in actual emergency department scenarios. Amazing figures. The kind of figures that, in the same meeting, cause hospital boards to feel both anxious and excited.
However, one of the writers, Erik Cornelisse, made a statement that resonated with me. A system is not safe for patients just because it is accurate, especially when it comes to text-only cases. It’s a straightforward statement that, once you read it, becomes clear, but it contradicts the way these tools are being promoted. A medical encounter is not a paragraph that has been typed. At three in the morning, a weary nurse observes a change in a patient’s breathing. When someone describes pain, there is a slight hesitancy in their voice. All of that is not contained within a prompt window.
Speaking with experts in the field gives the impression that the gap between what AI can accomplish in a lab and what it ought to accomplish in a clinic is growing rather than narrowing. The Flinders commentary highlights historical instances that are unpleasant to revisit. algorithms that subtly made racial disparities in healthcare spending worse. In independent testing, consumer-facing triage tools failed to identify over half of actual emergencies. Launched in January 2026 as a personalized information service, ChatGPT Health ultimately under-triaged the majority of the urgent cases it came across. It wasn’t designed for that. For that, people used it. They will always do so.

The discomfort is exacerbated by the timing of the World Health Organization’s European report. The largest barrier to the safe adoption of AI, according to 86% of the 50 member states surveyed, is legal uncertainty. The regional director, Hans Kluge, described the situation as what it is: a sharp increase without fundamental legal safeguards. The same week, the European Commission unveiled its Digital Omnibus package, which included a proposal to relax some of the GDPR, in an almost theatrical manner. It has already been dubbed the worst rollback of digital rights in EU history by more than 120 civil society organizations. European Digital Rights’ Ella Jakubowska put it more simply. She cautioned that under an ambiguous “legitimate interest” clause, our DNA might wind up training the AI systems of big businesses.
The contradiction is difficult to ignore. One branch of European governance is pleading for stricter regulations. Filing them down is another. In the meantime, the models continue to advance weekly and are now multimodal, consuming scans, images, and audio. The capabilities of GPT-5.3 and Gemini 3.1 Pro surpass those of tools from the previous year.
The senior author of the Flinders article, Associate Professor Ash Hopkins, made a statement that ought to be posted on the wall of every health ministry. Physicians are not permitted to practice without oversight and assessment. AI ought to be held to similar standards. For now, it’s unclear if anyone in Geneva or Brussels is paying enough attention.

