Diagnostic Accuracy of Large Language Models for Fictional Pediatric Neuroradiology Cases: A Comparative Study with Radiologists

Authors

DOI:

https://doi.org/10.7546/CRABS.2026.05.12

Keywords:

large language models, artificial intelligence, pediatric neuroradiology, diagnostic accuracy, radiology education

Abstract

The aim of this study was to evaluate diagnostic accuracy and differential-diagnosis quality of large language models (LLMs) for clinically realistic, text-only pediatric neuroradiology cases. This cross-sectional diagnostic accuracy study included 100 fictional pediatric neuroradiology cases composed of a brief clinical presentation and a structured text-only CT/MRI report curated by an expert pediatric neuroradiologist and a pediatrist. For each case, a reference primary diagnosis and two acceptable alternatives were pre-specified. Seven LLM variants (ChatGPT-5.2 Instant/Auto/Thinking; Gemini 3 Pro/Thinking; Claude 4.5 Opus/Opus Thinking) and three radiologists (two general radiologists; one pediatric radiologist) each provided one primary diagnosis and two differentials. Primary outcome was top-1 accuracy (exact match or acceptable alternative). Secondary outcomes were Differential Diagnosis Score (DDxScore, 1--5) and response time. Paired accuracy differences were assessed with Cochran's Q and post-hoc McNemar tests; DDxScore and response time were compared using Friedman tests with post-hoc Wilcoxon signed-rank tests and multiplicity correction. Top-1 accuracy ranged from 44–54% among radiologists (pediatric radiologist 54%) and 48–80% among LLMs (ChatGPT-5.2 Thinking/Auto and Gemini 3 Thinking 80%; Claude 4.5 Opus Thinking 76%). Overall accuracy differed across raters (Cochran's Q = 107.86, df = 9, p < 0.001). Median DDxScores were 4.0 (IQR 2.0–4.0) for the pediatric radiologist, 3.0 (2.0–4.0) for general radiologists, and up to 5.0 (4.0–5.0) for leading thinking-mode LLMs (p < 0.001). All LLMs were faster than radiologists (p < 0.001). In text-only pediatric neuroradiology cases, several contemporary LLMs matched or exceeded radiologists in top-1 accuracy and produced high-quality differentials with substantially shorter response times. These findings support further evaluation for education and audited decision-support workflows.

Author Biographies

Turay Cesur, Mamak State Hospital, Turkey

Mailing Address:
Department of Radiology,
Mamak State Hospital,
Üreğil 06270, Ankara, Türkiye

E-mail: turaycesur93@gmail.com

Yasin Celal Gunes, Kirikkale Yuksek Ihtisas Hospital, Turkey

Mailing Address:
Department of Radiology,
Kirikkale Yuksek Ihtisas Hospital,
Baglarbasi 71300, Kirikkale, Türkiye

E-mail: gunesyasincelal@gmail.com

Eren Camur, 29 Mayis State Hospital, Turkey

Mailing Address:
Department of Radiology,
29 Mayis State Hospital,
Dikmen 06460, Ankara, Türkiye

E-mail: eren.camur@outlook.com

Gulay Cesur Cinar, Thracian University, Turkey

Mailing Address:
Department of Pediatrics,
Thracian University,
Merkez 22030, Edirne, Türkiye

E-mail: drgulaycesur@gmail.com

Goksel Tuzcu, Aydin Adnan Menderes University, Turkey

Mailing Address:
Department of Pediatric Radiology,
Aydin Adnan Menderes University,
Efeler 09100, Aydin, Türkiye

E-mail: gtuzcu@adu.edu.tr

Avni Merter Keceli, Aydin Adnan Menderes University, Turkey

Mailing Address:
Department of Pediatric Radiology,
Aydin Adnan Menderes University,
Efeler 09100, Aydin, Türkiye

E-mail: avni.merter.keceli@adu.edu.tr

Downloads

Published

29-05-2026

How to Cite

[1]
T. Cesur, Y. Gunes, E. Camur, G. Cinar, G. Tuzcu, and A. Keceli, “Diagnostic Accuracy of Large Language Models for Fictional Pediatric Neuroradiology Cases: A Comparative Study with Radiologists”, C. R. Acad. Bulg. Sci., vol. 79, no. 5, pp. 640–649, May 2026.

Issue

Section

Medicine