Study: AI responses to healthcare queries are nearly 76% accurate
A new study led by Penn State researchers found that AI-powered chatbots answer everyday health questions with nearly 76% accuracy, raising concerns about their trustworthiness in real-world applications. The study, which involved a Diagnose-a-thon competition and evaluation by board-certified physicians, found that AI performed best in obstetrics and otolaryngology, but poorly in internal medicine, neurology, and dermatology. Researchers suggest AI tools may be more useful for physicians than patients.
Article intelligence
Key points
- LLM responses to health queries were 76.2% accurate overall, but error rates exceeded 20%, roughly double that of human physicians.
- AI performed best in obstetrics/gynecology and otolaryngology, and worst in internal medicine, neurology, and dermatology.
- The study used a participatory Diagnose-a-thon with 34 participants submitting 212 prompts to four LLMs, evaluated by nine board-certified physicians.
- Additional training on medical textbooks did not significantly improve AI performance; base models were preferred by a panel of medical professionals.
Why it matters
This matters because LLM responses to health queries were 76.2% accurate overall, but error rates exceeded 20%, roughly double that of human physicians.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Large language models like ChatGPT respond to health queries with nearly 76% accuracy, raising concerns about their trustworthiness in real-world applications, according to Penn State researchers. Credit: fizkes/Getty Images. All Rights Reserved.
UNIVERSITY PARK, Pa. — Artificial intelligence (AI)-powered chatbots respond to everyday health-related questions from general users with nearly 76% accuracy, which raises concerns about their trustworthiness in real-world client-facing applications, according to a new study led by Penn State researchers.
The researchers wanted to understand how the average person uses AI for health-related concerns and how accurately AI responds to everyday medical queries. They found that when it comes to healthcare, especially specialized areas like neurology and dermatology, AI tools may work best in the hands of trained physicians rather than patients. The team will present their findings at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency (FAccT) conference in Montreal, Canada, June 25-28.
“Our work focuses explicitly on healthcare scenarios that the average internet user might ask AI, which is a perspective that prior research into large language models (LLMs) and healthcare hasn’t covered,” said study co-author Amulya Yadav, associate professor of informatics and intelligent systems in Penn State’s College of Information Sciences and Technology (IST). “We wanted to understand that if people are using LLMs like ChatGPT as a symptom health checker, like historically we’ve used Google, how accurate is the LLM in answering those queries, and how harmful could those responses be?”
To understand how accurate or harmful health-related LLM responses could be for the average internet user, the researchers held an AI competition called a Diagnose-a-thon at Penn State. A total of 34 participants — comprising faculty, staff and undergraduate and graduate students — submitted 212 prompts and AI-generated responses to real and imaginary health concerns written from both patient and doctor perspectives. Participants were allowed to choose one of four LLMs to use for the contest: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro and Llama3-8b.
“One of the strengths of our study is we’re essentially trying to replicate real-world usage of LLMs by telling participants to choose the LLM of their choice and use it as they would on a normal day,” said Bonam Mingole, lead author of the study and doctoral candidate in information sciences and technology. “This type of participatory research is so important for understanding how the public uses AI in their daily life.”
The researchers then asked nine board-certified physicians to evaluate the accuracy of the AI-generated responses and how harmful they may be using a six-point scale ranging from very low to very high. A competition committee awarded prizes to the top eight submissions that generated the most medically accurate information and a prize to the submission that generated the response most likely to cause harm.
They found that overall, 76.2% of LLM-generated responses provided accurate information. Specialties such as obstetrics and gynecology and otolaryngology — the treatment of disorders that affect the ear, nose and throat — saw the best LLM performance, with high validity scores and low harm scores. Internal medicine, neurology and dermatology saw the worst AI performance, with low validity scores and higher harm scores, according to the researchers. They added that very specific prompts, and prompts between 60 and 250 characters, resulted in more accurate LLM outputs.
The researchers then took the base model of each LLM and trained it on medical textbooks, clinical guidelines and peer-reviewed research articles included in a medical school curriculum to see if additional training would increase response validity scores and decrease harm scores. They asked a panel of seven medical professionals and trainees — a board-certified physician, two second-year internal medicine residents, two fourth-year medical students and two third-year medical students — to assess the base LLM responses and responses from the augmented LLMs and determine which were more clinically appropriate. The researchers found that the panel preferred the responses from the Gemini and Llama base models over the augmented models, and no significant preference for the ChatGPT models.
“We’re entering a new age of healthcare, and AI is a significant part of it,” said study co-author Jennifer Kraschnewski, director of the Penn State Clinical and Translational Science Institute and professor in internal medicine at the Penn State College of Medicine. “There’s a real opportunity for healthcare to transform, to integrate these new tools so that clinicians like myself can use them to improve patient care.”
The researchers also noted that despite the LLM validity scores, AI error rates still exceeded 20%, roughly double the error rate of human physicians. Those errors, they said, could potentially be harmful to patients.
“I don’t think AI will replace human physicians, but I do think there’s a huge opportunity for us to help upskill today’s physician in a way that’s never been done before,” said Kraschnewski, suggesting that current LLMs may prove better tools for medical professionals than patients.
Overall, the study highlights the potential beneficial and harmful impacts that AI may have on a key aspect of everyone’s life, according to the researchers.
“Like it or not, people will continue to use AI for diagnosing their health problems,” said study co-author S. Shyam Sundar, Evan Pugh University Professor and James P. Jimirro Professor of Media Effects at Penn State. “By understanding their use patterns and testing the validity of AI performance, our project helps advance literacy on the best and worst uses of AI for medical advice.”
Aditya Majumdar and Firdaus Ahmed Choudhury, doctoral students in Penn State’s College of IST, also contributed to the study. The Center for Socially Responsible Artificial Intelligence at Penn State hosted the Diagnose-a-thon competition.
Last Updated May 28, 2026
Contact
Francisco Tutella
Tags
Research
Health and Medicine
Science and Technology
Students
Faculty and Staff
News of Record
Publications
Hershey
University Park
Bellisario College of Communications
Department of Media Studies
Media Effects Research Laboratory
Information Sciences and Technology
Center for Socially Responsible Artificial Intelligence
Human-Centered Computing and Social Informatics
Medicine
Clinical and Translational Science Institute
Fox Graduate School
Latest News
Artificial Intelligence