Asking a machine a question about your own body and getting a response that sounds like it was written by a senior doctor with thirty years of experience on the wards is incredibly unsettling. The sentences are clear. The tone is quantified. The footnotes appear authentic. Furthermore, roughly half of those responses are incorrect, according to a recent study that was published in BMJ Open.
Five of the most popular chatbots in the world were put through a stress test that any patient could recognize by researchers led by UCLA. Fifty questions about cancer, vaccines, stem cells, nutrition, and athletic performance were posed to ChatGPT, Gemini, Grok, Meta AI, and DeepSeek. Each response was graded by two experts. Approximately 20% of the responses were deemed extremely problematic. Half had some sort of issue. Out of 250 questions, only two were categorically rejected by Meta AI. These two questions dealt with genuinely risky subjects like anabolic steroids and experimental cancer treatments.
| Detail | Information |
|---|---|
| Study Title | Half of AI health answers are wrong even though they sound convincing |
| Published In | BMJ Open |
| Publication Date | April 2026 |
| Lead Institution | University of California, Los Angeles |
| Author of Analysis | Carsten Eickhoff, Professor of Medical Data Science, University of Tübingen |
| Chatbots Tested | ChatGPT, Gemini, Grok, Meta AI, DeepSeek |
| Total Questions | 250 (50 per chatbot) |
| Topics Covered | Cancer, vaccines, stem cells, nutrition, athletic performance |
| Problematic Responses | 49.6% problematic, 19.6% highly problematic |
| Worst Performer | Grok (58% flagged) |
| Best Performer | Gemini (fewest highly problematic responses) |
| Reference Accuracy | Median completeness score: 40% |
| Refusals to Answer | 2 out of 250 (both from Meta AI) |
It’s difficult to ignore the pattern. The two fields where decades of clean, peer-reviewed research are stored in well-organized databases—vaccines and cancer—were the best for chatbots. Even there, about 25% of the answers were problematic. The models faltered most in the areas of nutrition and athletic performance, which are already crowded with contradictory advice and internet rumors. Overall, Grok did the worst, with 58% of its responses flagged. Meta AI was at 50%, while ChatGPT was at 52%. Their differences seem almost purely aesthetic.
What transpired with citations is the detail that ought to cause any reader to pause. The median completeness score was only 40% when ten scientific references were requested. Out of twenty-five attempts, not a single chatbot generated a reference list that was completely accurate. A few of the links broke. A few writers were made up. Some well-formatted and footnoted papers were just nonexistent. There is no real way for a patient sitting in a clinic waiting room to distinguish between a citation that appears to be identical to every other citation.

Chatbots act in this way for a straightforward reason that should be stated clearly. These models are ignorant. Using patterns in their training data—which includes peer-reviewed journals, Reddit threads, supplement blogs, and the chaotic sea of social media—they forecast the next likely word. They don’t consider the evidence. They don’t stop. They write fluently because that’s what they were designed to do, and fluency is a kind of disguise in the medical field.
The researchers purposefully phrased prompts in ways intended to push the models toward error, a technique known as “red teaming” in AI safety circles. When compared to thoughtful, impartial questions, that most likely increases the failure rate. But who carefully crafts their questions when sitting at home with an odd new lump or a concerning test result? The messy, open-ended question that the chatbots struggled with is exactly what most people type because that’s how they think. Of the open-ended responses, 32% were deemed to be extremely problematic. Only seven points were awarded for closed questions—the tidy true-or-false type.
Something even stranger was discovered in a paper published in Nature Medicine in February 2026. On their own, the chatbots correctly answered about 95% of medical questions. However, accuracy fell to less than 35% when actual people used those same chatbots, which was no better than when they didn’t use them at all. In other words, there are other issues besides the technology. Yes, we are.
These tools won’t disappear. Most likely, they shouldn’t. They are helpful for summarizing research, formulating questions prior to a meeting, and outlining a subject before delving deeper. However, the study quietly but firmly argues that they should never be regarded as independent medical authorities. Check the claims. Examine the references. Additionally, you might not want to trust a chatbot when it sounds very certain.

