You are here
News Release
Tuesday, July 23, 2024
NIH findings shed light on risks and benefits of integrating AI into medical decision-making
AI model scored well on medical diagnostic quiz, but made mistakes explaining answers.
Researchers at the 最新麻豆视频 (NIH) found that an artificial intelligence (AI) model solved medical quiz questions鈥攄esigned to test health professionals鈥 ability to diagnose patients based on clinical images and a brief text summary鈥攚ith high accuracy. However, physician-graders found the AI model made mistakes when describing images and explaining how its decision-making led to the correct answer. The findings, which shed light on AI鈥檚 potential in the clinical setting, were published in . 聽The study was led by researchers from NIH鈥檚 最新麻豆视频 Library of Medicine (NLM) and Weill Cornell Medicine, New York City.
鈥淚ntegration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner,鈥 said NLM Acting Director, Stephen Sherry, Ph.D. 鈥淗owever, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.鈥
The AI model and human physicians answered questions from the New England Journal of Medicine (NEJM)鈥檚 Image Challenge. The challenge is an online quiz that provides real clinical images and a short text description that includes details about the patient鈥檚 symptoms and presentation, then asks users to choose the correct diagnosis from multiple-choice answers.
The researchers tasked the AI model to answer 207 image challenge questions and provide a written rationale to justify each answer. The prompt specified that the rationale should include a description of the image, a summary of relevant medical knowledge, and provide step-by-step reasoning for how the model chose the answer.
Nine physicians from various institutions were recruited, each with a different medical specialty, and answered their assigned questions first in a 鈥渃losed-book鈥 setting, (without referring to any external materials such as online resources) and then in an 鈥渙pen-book鈥 setting (using external resources). The researchers then provided the physicians with the correct answer, along with the AI model鈥檚 answer and corresponding rationale. Finally, the physicians were asked to score the AI model鈥檚 ability to describe the image, summarize relevant medical knowledge, and provide its step-by-step reasoning.
The researchers found that the AI model and physicians scored highly in selecting the correct diagnosis. Interestingly, the AI model selected the correct diagnosis more often than physicians in closed-book settings, while physicians with open-book tools performed better than the AI model, especially when answering the questions ranked most difficult.
Importantly, based on physician evaluations, the AI model often made mistakes when describing the medical image and explaining its reasoning behind the diagnosis 鈥 even in cases where it made the correct final choice. In one example, the AI model was provided with a photo of a patient鈥檚 arm with two lesions. A physician would easily recognize that both lesions were caused by the same condition. However, because the lesions were presented at different angles 鈥 causing the illusion of different colors and shapes 鈥 the AI model failed to recognize that both lesions could be related to the same diagnosis.
The researchers argue that these findings underpin the importance of evaluating multi-modal AI technology further before introducing it into the clinical setting.
鈥淭his technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making,鈥 said NLM Senior Investigator and corresponding author of the study, Zhiyong Lu, Ph.D. 鈥淯nderstanding the risks and limitations of this technology is essential to harnessing its potential in medicine.鈥
The study used an AI model known as GPT-4V (Generative Pre-trained Transformer 4 with Vision), which is a 鈥榤ultimodal AI model鈥 that can process combinations of multiple types of data, including text and images. The researchers note that while this is a small study, it sheds light on multi-modal AI鈥檚 potential to aid physicians鈥 medical decision-making. More research is needed to understand how such models compare to physicians鈥 ability to diagnose patients.
The study was co-authored by collaborators from NIH鈥檚 最新麻豆视频 Eye Institute and the NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center, Dallas; New York University Grossman School of Medicine, New York City; Harvard Medical School and Massachusetts General Hospital, Boston; Case Western Reserve University School of Medicine, Cleveland; University of California San Diego, La Jolla; and the University of Arkansas, Little Rock.
The 最新麻豆视频 Library of Medicine (NLM) is a leader in research in biomedical informatics and data science and the world鈥檚 largest biomedical library. NLM conducts and supports research in methods for recording, storing, retrieving, preserving, and communicating health information. NLM creates resources and tools that are used billions of times each year by millions of people to access and analyze molecular biology, biotechnology, toxicology, environmental health, and health services information. Additional information is available at
About the 最新麻豆视频 (NIH): NIH, the nation's medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit www.nih.gov.
NIH…Turning Discovery Into Health庐
Reference
Qiao Jin, et al. Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine. npj Digital Medicine. DOI: (2024).