top of page

ChatGPT Achieves Unprecedented Accuracy and Surpasses Humans in Diagnosing Brain Tumors


A recent study compared the diagnostic accuracy of ChatGPT, based on the GPT-4 model, with that of radiologists on 150 MRI reports of brain tumors. ChatGPT achieved a diagnostic accuracy of 73%, slightly higher than that of neuroradiologists (72%) and general radiologists (68%). These findings reinforce the emerging role of AI in radiology, with the potential to reduce physician workload and increase diagnostic accuracy in the future.


The emergence and subsequent advancements of large language models (LLMs) such as the GPT series have recently dominated the global discourse on technology.


These models represent a new frontier in artificial intelligence, using machine learning techniques to process and generate language in a way that rivals human-level complexity and nuance.


The rapid evolution and widespread impact of LLMs have become a global phenomenon, sparking discussions about their potential applications and implications.


Additionally, the introduction of chatbots like Chat Generative Pre-trained Transformer (ChatGPT), which uses these large language models to generate conversations, has made it easier to utilize these models in a conversational format.

In the context of LLMs, the GPT series in particular has gained significant attention. Many applications have been explored in the field of radiology.


Among these, the potential of GPT to aid in diagnosis from imaging findings is notable because such capabilities can complement essential aspects of daily clinical practice and education.


Two studies demonstrate the potential of GPT-4 to generate differential diagnosis in the field of neuroradiology. One study uses the “Case of the Week” from the American Journal of Neuroradiology, and the other study uses cases from the “Freiburg Neuropathology Case Conference” from the journal Clinical Neuroradiology.


In addition, large language models such as GPT-4 have shown differential diagnostic potential in subspecialties beyond the field of neuroradiology. Although these pioneering investigations suggest that GPT-4 may play an important role in radiological diagnosis, there are no studies reporting evaluation using real-world radiology reports.


Unlike questionnaires, which tend to feature carefully selected typical cases and are created by individuals already aware of the correct diagnosis, real-world radiology reports can contain less structured and more diverse information. This difference can lead to biased assessments that do not reflect the complex nature of clinical radiology.


To address this gap, a new study examines the diagnostic capabilities of GPT-4 using only real-world clinical radiology reports. In daily clinical practice, thinking through differential and final diagnoses can be challenging and time-consuming. If GPT-4 can excel in this diagnostic process, it indicates potential value in clinical settings.


The study, conducted by researchers at Osaka Metropolitan University and published in the European Radiology Journal, explored the use of GPT-4, a large language model (LLM), for diagnostics in radiology, specifically in cases of brain tumors.


The idea is to see if AI can support radiologists, especially when dealing with complex MRI reports, which is important for patients who need fast and accurate diagnosis.


The research focused on real-world MRI reports of brain tumor patients from two Japanese institutions between 2017 and 2021. These reports, originally in Japanese, were translated into English by experienced radiologists. GPT-4 and five radiologists were then given the textual information from the same reports and instructed to suggest differential diagnoses (possible alternative conditions) and final diagnoses for each case.


Diagnostic accuracy was verified using the final pathological diagnosis, made after analysis of surgically removed tumor tissue, which served as the “gold standard” to validate diagnoses.

Of the 150 cases, GPT-4 achieved 73% accuracy in final diagnoses, comparable to the range of accuracy for radiologists, which was 65% to 79%. For differential diagnoses, GPT-4 achieved 94% accuracy, higher than the accuracy of radiologists, which ranged from 73% to 89%.


Interestingly, GPT-4’s accuracy in differential diagnoses was high and consistent, regardless of whether the reports were written by neuroradiologists or general radiologists.


These results indicate that GPT-4 has the potential to become a valuable adjunct tool for brain tumor diagnostics. It can serve as a “second opinion” for neuroradiologists in cases of final diagnoses, helping to increase confidence in the decision-making process, and as a guiding tool for general radiologists and residents, who can benefit from the diagnostic suggestions provided by the model.


The consistency of GPT-4 in differential diagnoses reinforces the possibility of integrating AI into clinical practices to optimize the diagnostic process in high-demand situations, allowing for greater precision and agility in patient care.



READ MORE:


Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors. 

Yasuhito Mitsuyama, Hiroyuki Tatekawa, Hirotaka Takita, Fumi Sasaki, Akane Tashiro, Satoshi Oue, Shannon L. Walston, Yuta Nonomiya, Ayumi Shintani, Yukio Miki & Daiju Ueda 

Eur Radiol. Imaging Informatics and Artificial Intelligence. August 2024


Abstract:


Large language models like GPT-4 have demonstrated potential for diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals. This study aimed to assess the diagnostic capabilities of GPT-4-based Chat Generative Pre-trained Transformer (ChatGPT) using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists. We collected brain MRI reports written in Japanese from preoperative brain tumor patients at two institutions from January 2017 to December 2021. The MRI reports were translated into English by radiologists. GPT-4 and five radiologists were presented with the same textual findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. McNemar’s test and Fisher’s exact test were used for statistical analysis. In a study analyzing 150 radiological reports, GPT-4 achieved a final diagnostic accuracy of 73%, while radiologists’ accuracy ranged from 65 to 79%. GPT-4’s final diagnostic accuracy using reports from neuroradiologists was higher at 80%, compared to 60% using those from general radiologists. In the realm of differential diagnoses, GPT-4’s accuracy was 94%, while radiologists’ fell between 73 and 89%. Notably, for these differential diagnoses, GPT-4’s accuracy remained consistent whether reports were from neuroradiologists or general radiologists. GPT-4 exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. GPT-4 can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents.

Comments


bottom of page