The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

Asking a machine a question about your own body and getting a response that sounds like it was written by a senior doctor with thirty years of experience on the wards is incredibly unsettling. The sentences are clear. The tone is quantified. The footnotes appear authentic. Furthermore, roughly half of those responses are incorrect, according to a recent study that was published in BMJ Open.

Five of the most popular chatbots in the world were put through a stress test that any patient could recognize by researchers led by UCLA. Fifty questions about cancer, vaccines, stem cells, nutrition, and athletic performance were posed to ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Each response was graded by two experts. Approximately 20% of the responses were deemed extremely problematic. Half had some sort of issue. Out of 250 questions, only two were categorically rejected by Meta AI. These two questions dealt with genuinely risky subjects like anabolic steroids and experimental cancer treatments.

Detail	Information
Study Title	Half of AI health answers are wrong even though they sound convincing
Published In	BMJ Open
Publication Date	April 2026
Lead Institution	University of California, Los Angeles
Author of Analysis	Carsten Eickhoff, Professor of Medical Data Science, University of Tübingen
Chatbots Tested	ChatGPT, Gemini, Grok, Meta AI, DeepSeek
Total Questions	250 (50 per chatbot)
Topics Covered	Cancer, vaccines, stem cells, nutrition, athletic performance
Problematic Responses	49.6% problematic, 19.6% highly problematic
Worst Performer	Grok (58% flagged)
Best Performer	Gemini (fewest highly problematic responses)
Reference Accuracy	Median completeness score: 40%
Refusals to Answer	2 out of 250 (both from Meta AI)

It’s difficult to ignore the pattern. The two fields where decades of clean, peer-reviewed research are stored in well-organized databases—vaccines and cancer—were the best for chatbots. Even there, about 25% of the answers were problematic. The models faltered most in the areas of nutrition and athletic performance, which are already crowded with contradictory advice and internet rumors. Overall, Grok did the worst, with 58% of its responses flagged. Meta AI was at 50%, while ChatGPT was at 52%. Their differences seem almost purely aesthetic.

What transpired with citations is the detail that ought to cause any reader to pause. The median completeness score was only 40% when ten scientific references were requested. Out of twenty-five attempts, not a single chatbot generated a reference list that was completely accurate. A few of the links broke. A few writers were made up. Some well-formatted and footnoted papers were just nonexistent. There is no real way for a patient sitting in a clinic waiting room to distinguish between a citation that appears to be identical to every other citation.

The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

Chatbots act in this way for a straightforward reason that should be stated clearly. These models are ignorant. Using patterns in their training data—which includes peer-reviewed journals, Reddit threads, supplement blogs, and the chaotic sea of social media—they forecast the next likely word. They don’t consider the evidence. They don’t stop. They write fluently because that’s what they were designed to do, and fluency is a kind of disguise in the medical field.

The researchers purposefully phrased prompts in ways intended to push the models toward error, a technique known as “red teaming” in AI safety circles. When compared to thoughtful, impartial questions, that most likely increases the failure rate. But who carefully crafts their questions when sitting at home with an odd new lump or a concerning test result? The messy, open-ended question that the chatbots struggled with is exactly what most people type because that’s how they think. Of the open-ended responses, 32% were deemed to be extremely problematic. Only seven points were awarded for closed questions—the tidy true-or-false type.

Something even stranger was discovered in a paper published in Nature Medicine in February 2026. On their own, the chatbots correctly answered about 95% of medical questions. However, accuracy fell to less than 35% when actual people used those same chatbots, which was no better than when they didn’t use them at all. In other words, there are other issues besides the technology. Yes, we are.

These tools won’t disappear. Most likely, they shouldn’t. They are helpful for summarizing research, formulating questions prior to a meeting, and outlining a subject before delving deeper. However, the study quietly but firmly argues that they should never be regarded as independent medical authorities. Check the claims. Examine the references. Additionally, you might not want to trust a chatbot when it sounds very certain.

What's Hot

The Scientific Definition of Wellbeing Has Finally Arrived – Here’s Why It Matters More Than Any Wellness Trend of the Last Decade.

The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

How University of Utah Health’s April 2026 Momentos Captured a Month of Clinical Excellence That Deserves a National Audience

Why the Gap Between AI Health Capability and AI Health Safety Is Now the Most Urgent Unresolved Problem in American Medicine

How AI Is Advancing Healthcare and Wellbeing in Seven Distinct Ways That Microsoft’s Own Research Has Now Documented

Can Chia Seed Water Actually Lower Your Cholesterol? We Asked a Cardiologist – The Answer Is More Nuanced Than the Internet Suggests.

The Scientific Definition of Wellbeing Has Finally Arrived – Here’s Why It Matters More Than Any Wellness Trend of the Last Decade.

The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

How University of Utah Health’s April 2026 Momentos Captured a Month of Clinical Excellence That Deserves a National Audience

How One Texas Community Found Resilience After the Storm — and Why the Red Cross Model Is Worth Studying Nationally

Why the Gap Between AI Health Capability and AI Health Safety Is Now the Most Urgent Unresolved Problem in American Medicine

How Corporate Wellbeing Programmes Evolved From Gym Discounts to Genuine Mental Health Infrastructure in Under a Decade

Our Picks

The Scientific Definition of Wellbeing Has Finally Arrived – Here’s Why It Matters More Than Any Wellness Trend of the Last Decade.

The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

How University of Utah Health’s April 2026 Momentos Captured a Month of Clinical Excellence That Deserves a National Audience

What's Hot

The New Study Shows Half of AI Health Answers Are Wrong — Even Though Every Single One Sounds Completely Convincing

Related Posts