Diagnostic Accuracy of Large Language Models for Fictional Pediatric Neuroradiology Cases: A Comparative Study with Radiologists
DOI:
https://doi.org/10.7546/CRABS.2026.05.12Keywords:
large language models, artificial intelligence, pediatric neuroradiology, diagnostic accuracy, radiology educationAbstract
The aim of this study was to evaluate diagnostic accuracy and differential-diagnosis quality of large language models (LLMs) for clinically realistic, text-only pediatric neuroradiology cases. This cross-sectional diagnostic accuracy study included 100 fictional pediatric neuroradiology cases composed of a brief clinical presentation and a structured text-only CT/MRI report curated by an expert pediatric neuroradiologist and a pediatrist. For each case, a reference primary diagnosis and two acceptable alternatives were pre-specified. Seven LLM variants (ChatGPT-5.2 Instant/Auto/Thinking; Gemini 3 Pro/Thinking; Claude 4.5 Opus/Opus Thinking) and three radiologists (two general radiologists; one pediatric radiologist) each provided one primary diagnosis and two differentials. Primary outcome was top-1 accuracy (exact match or acceptable alternative). Secondary outcomes were Differential Diagnosis Score (DDxScore, 1--5) and response time. Paired accuracy differences were assessed with Cochran's Q and post-hoc McNemar tests; DDxScore and response time were compared using Friedman tests with post-hoc Wilcoxon signed-rank tests and multiplicity correction. Top-1 accuracy ranged from 44–54% among radiologists (pediatric radiologist 54%) and 48–80% among LLMs (ChatGPT-5.2 Thinking/Auto and Gemini 3 Thinking 80%; Claude 4.5 Opus Thinking 76%). Overall accuracy differed across raters (Cochran's Q = 107.86, df = 9, p < 0.001). Median DDxScores were 4.0 (IQR 2.0–4.0) for the pediatric radiologist, 3.0 (2.0–4.0) for general radiologists, and up to 5.0 (4.0–5.0) for leading thinking-mode LLMs (p < 0.001). All LLMs were faster than radiologists (p < 0.001). In text-only pediatric neuroradiology cases, several contemporary LLMs matched or exceeded radiologists in top-1 accuracy and produced high-quality differentials with substantially shorter response times. These findings support further evaluation for education and audited decision-support workflows.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Proceedings of the Bulgarian Academy of SciencesCopyright (c) 2022 Proceedings of the Bulgarian Academy of Sciences
Copyright is subject to the protection of the Bulgarian Copyright and Associated Rights Act. The copyright holder of all articles on this site is Proceedings of the Bulgarian Academy of Sciences. If you want to reuse any part of the content, please, contact us.

