Applications of large language models in radiology: a systematic review.
- Authors: Vasilev Y.A.1, Reshetnikov R.V.2, Nanova O.G.3, Vladzymyrskyy A.V.1, Arzamasov K.M.1, Omelyanskaya O.V.1, Kodenko M.R.4, Erizhokov R.A.2, Pamova A.P.5
-
Affiliations:
- Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
- Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department
- Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies, Department of Health Care of Moscow, Russian Federation Petrovka Street, 24, Building 1, 127051 Moscow, Russia
- Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies
- State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department"
- Section: Systematic reviews
- Submitted: 06.05.2025
- Accepted: 12.06.2025
- Published: 17.06.2025
- URL: https://jdigitaldiagnostics.com/DD/article/view/678373
- DOI: https://doi.org/10.17816/DD678373
- ID: 678373
Cite item
Full Text
Abstract
Introduction Modern large language models have the potential to be used in radiological diagnostics to address a number of routine tasks: generating structured reports, extracting information from radiological reports, and making diagnoses. To realize this potential, it is necessary to assess the diagnostic effectiveness and reproducibility of the results of large language models.
Objective: To analyze the worldwide literature on the application of large language models in radiological diagnostics, evaluate the diagnostic effectiveness and accuracy of these models in addressing existing tasks, and identify potential problems that may hinder the implementation of large language models in radiological practice.
Materials and methods Searching of relevant works was conducted in the PubMed and RSCI databases, as well as in the reference lists (2023-2025). The quality of the selected studies was assessed with QUADAS-CAD questionnaire.
Results Nine studies were included. The most commonly encountered tasks were the diagnosis based on radiological reports (3 studies) and the detection of clinically significant findings in reports (2). GPT-4 (5) and BERT (3) were the most frequently used large language models, GPT-3.5, Llama 2, Med42, GPT-4V, and Gemini Pro were also appearing. GPT-4 demonstrated high effectiveness and accuracy in diagnosing brain tumors (accuracy 73,0%), in diagnosing myocarditis (83,0%), and in decision-making regarding invasive procedures for acute coronary syndrome (86,0%). The diagnostic effectiveness and accuracy of the GPT-4 model were not high in diagnosing pathologies of the nervous system of various origins (50,0%) and musculoskeletal disorders (43,0%). The BERT model showed high diagnostic effectiveness and accuracy in tasks related to the detection of pulmonary nodules (99,0%) and signs of intracranial hemorrhage (sensitivity 97,0% and specificity 90,0%), and in the task of classifying reports (accuracy 84,3%).
Most of the studies (88,9%) contain the probability of systematic error. The main reasons for this include small and imbalanced samples, overlap between training and test datasets, and insufficiently accurate preparation and description of reference standards.
Discussion The diagnostic effectiveness parameters of large language models vary between different studies. For the implementation of large language models in the practice, it is necessary to standardize and improve the quality of methods in AI research.
Full Text
Funding source. The publication of this work was supported by the Moscow Government Grant "Research on the application of large language models in the field of healthcare based on artificial intelligence technologies" in accordance with the Moscow Government Resolution of April 1, 2025 No. 656-PP.
Competing interests. The authors declare that they have no competing interests.
Author contribution. All authors confirm that their authorship meets the international ICMJE criteria (all authors have made a significant contribution to the development of the concept, research and preparation of the article, read and approved the final version before publication). The largest contribution is distributed as follows: Yu.A. Vasiliev, A.V. Vladzymyrskyy, O.V. Omelyanskaya – development of the research concept, approvement of the final version of the manuscript. R.V. Reshetnikov, O.G. Nanova, K.M. Arzamasov, M.R. Kodenko, R.A. Erizhokov – literature review, data analysis, writing the text of the manuscript.
TABLES
Table 1. List of the included studies and their basic characteristics
Table 2. Diagnostic parameters of large language models and medical workers: sensitivity, specificity, and accuracy
FIGURES
Fig. 1. Systematic literature search flowchart
Fig. 2. Risk of bias estimation by QUADAS-CAD
SUPPLEMENTARY
Table S1. List of the included studies and their basic characteristics (continuation of table 1)
Table S2. QUADAS-CAD domain questions: the italic font denotes key questions.
About the authors
Yuriy A. Vasilev
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: npcmr@zdrav.mos.ru
ORCID iD: 0000-0002-5283-5961
SPIN-code: 4458-5608
MD, Dr. Sci. (Medicine)
Russian Federation, MoscowRoman V. Reshetnikov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department
Email: r.reshetnikov@npcmr.ru
ORCID iD: 0000-0002-9661-0254
Cand. Sci. (Physical and Mathematical), Department Head of Medical Research, Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department, Moscow, 127051, Russian Federation
e-mail: r.reshetnikov@npcmr.ru
Russian FederationOlga G. Nanova
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies, Department of Health Care of Moscow, Russian FederationPetrovka Street, 24, Building 1, 127051 Moscow, Russia
Author for correspondence.
Email: nanova@mail.ru
ORCID iD: 0000-0001-8886-3684
SPIN-code: 6135-4872
Leading Researcher
Russian Federation, 24/1 Petrovka street, 127051 Moscow, RussiaAnton V. Vladzymyrskyy
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: vladzimirskijAV@zdrav.mos.ru
ORCID iD: 0000-0002-2990-7736
SPIN-code: 3602-7120
MD, Dr. Sci. (Medicine)
Russian Federation, MoscowKirill M. Arzamasov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: ArzamasovKM@zdrav.mos.ru
ORCID iD: 0000-0001-7786-0349
SPIN-code: 3160-8062
MD, Cand. Sci. (Medicine)
Russian Federation, MoscowOlga V. Omelyanskaya
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: OmelyanskayaOV@zdrav.mos.ru
ORCID iD: 0000-0002-0245-4431
SPIN-code: 8948-6152
Russian Federation, Moscow
Maria R. Kodenko
Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: m.r.kodenko@yandex.ru
ORCID iD: 0000-0002-0166-3768
SPIN-code: 5789-0319
Russian Federation
Rustam A. Erizhokov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department
Email: ErizhokovRA@zdrav.mos.ru
ORCID iD: 0009-0007-3636-2889
SPIN-code: 2274-6428
Junior Research Fellow, Head of Department
24/1 Petrovka street, 127051 Moscow, RussiaAnastasia P. Pamova
State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department"
Email: PamovaAP@zdrav.mos.ru
ORCID iD: 0000-0002-0041-3281
SPIN-code: 5146-4355
Russian Federation, 24/1 Petrovka street, 127051 Moscow, Russia
References
- Cherif H., Moussa C., Missaoui A.M., Salouage I., Mokaddem S., Dhahri B. Appraisal of ChatGPT's aptitude for medical education: comparative analysis with third-year medical students in a pulmonology examination // JMIR Med Educ. 2024. e52818. doi: 10.2196/52818.
- Kim W., Kim B.C., Yeom H.G. Performance of Large Language Models on the Korean dental licensing examination: a comparative study // Int Dent J. 2025 Vol. 75, N 1. P. 176-184. doi: 10.1016/j.identj.2024.09.002.
- Busch F., Hoffmann L., dos Santos D.P. et al. Large language models for structured reporting in radiology: past, present, and future // Eur Radiol. 2024. https://doi.org/10.1007/s00330-024-11107-6.
- Lecler A., Duron L., Soyer P. Revolutionizing radiology with GPT-based models: current applications, future possibilities and limitations of ChatGPT // Diagnostic and Interventional Imaging. 2023 Vol. 104, N. 6. P. 269-274. https://doi.org/10.1016/j.diii.2023.02.003.
- Методические рекомендации по подготовке систематического обзора. – Москва: Государственное бюджетное учреждение здравоохранения города Москвы «Научно-практический клинический центр диагностики и телемедицинских технологий Департамента здравоохранения города Москвы», 2023. – 34 с.
- Kodenko M.R., Vasilev Y.A., Vladzymyrskyy A.V., Omelyanskaya O.V., Leonov D.V., Blokhin I.A., Novik V.P., Kulberg N.S., Samorodov A.V., Mokienko O.A., Reshetnikov R.V. Diagnostic accuracy of AI for opportunistic screening of abdominal aortic aneurysm in CT: a systematic review and narrative synthesis // Diagnostics. 2022. Vol. 12. 3197. doi: 10.3390/diagnostics12123197.
- Horiuchi D., Tatekawa H., Oura T. et al. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology // Eur Radiol. 2025. Vol. 35. P. 506–516. https://doi.org/10.1007/s00330-024-10902-5.
- Mitsuyama Y., Tatekawa, H., Takita H. et al. Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors // Eur. Radiol. 2024. https://doi.org/10.1007/s00330-024-11032-8.
- Kaya K., Gietzen C., Hahnfeldt R. et al. Generative pre-trained transformer 4 analysis of cardiovascular magnetic resonance reports in suspected myocarditis: a multicenter study // J Cardiovasc Magn Reson. 2024. Vol. 26, N. 2. 101068. doi: 10.1016/j.jocmr.2024.101068.
- Grolleau E., Couraud S., Jupin Delevaux E., Piegay C., Mansuy A., de Bermont J., Cotton F., Pialat J.B., Talbot F., Boussel L. Incidental pulmonary nodules: Natural language processing analysis of radiology reports // Respir Med Res. 2024. Vol. 86. 101136. doi: 10.1016/j.resmer.2024.101136.
- Khoruzhaya A.N., Kozlov D.V., Arzamasov K.M., Kremneva E.I. Comparison of an ensemble of machine learning models and the BERT language model for analysis of text descriptions of brain CT reports to determine the presence of intracranial hemorrhage // Sovrem Tekhnologii Med. 2024. Vol. 16, N. 1. P. 27-34. doi: 10.17691/stm2024.16.1.03.
- Han T., Adams L.C., Bressem K.K., Busch F., Nebelung S., Truhn D. Comparative analysis of multimodal large language model performance on clinical vignette questions // JAMA. 2024. Vol. 331, N. 15. P. 1320-1321. doi: 10.1001/jama.2023.27861.
- Horiuchi D., Tatekawa H., Shimono T. et al. Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases // Neuroradiology. 2024. Vol. 66. P. 73–79. https://doi.org/10.1007/s00234-023-03252-4.
- Wataya T., Miura A., Sakisuka T. et al. Comparison of natural language processing algorithms in assessing the importance of head computed tomography reports written in Japanese // Jpn J Radiol. 2024. V. 42. P. 697–708. https://doi.org/10.1007/s11604-024-01549-9.
- Cagnina A., Salihu A., Meier D., Luangphiphat W., Faltin B., Skalidis I., Zimmerli A., Rotzinger D., Dine Qanadli S., Muller O., Abbe E., Fournier S. Assessing the need for coronary angiography in high-risk non-ST-elevation acute coronary syndrome patients using artificial intelligence and computed tomography // Int J Cardiovasc Imaging. 2025. Vol. 41, N. 1. P. 55-61. doi: 10.1007/s10554-024-03283-9.
- Bonferroni C.E. Il Calcolo Delle Assicurazioni su Gruppi di Teste. Studi in Onore del Professore Salvatore Ortu Carboni, Rome, Italy, 1935. P. 13–60.
- Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing // Journal of the Royal Statistical Society: Series B (Methodological). 1995. Vol. 57, N 1. P. 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
- Hollestein L.M., Lo S.N., Leonardi‐Bee J., Rosset S., Shomron N., Couturier D.‐L., Gran S. MULTIPLE ways to correct for MULTIPLE comparisons in MULTIPLE types of studies // British Journal of Dermatology. 2021 Vol. 185, N 1. P. 1081–1083. https://doi.org/10.1111/bjd.20600.
- Collins G. S., Moons K. G. M., Dhiman P., Riley R. D., Beam A. L., Van Calster B. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods // BMJ. 2024. Vol. 385. e078378. doi: 10.1136/bmj-2023-078378.
- Cohen J.F., Korevaar D.A., Altman D.G. et al. Guidelines for reporting diagnostic accuracy studies: explanation and elaboration // BMJ. 2016. Vol. 6. e012799. doi: 10.1136/bmjopen-2016-012799.
- Bossuyt P.M., Reitsma J.B., Bruns D.E. et al. An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. https://www.equator-network.org/wp-content/uploads/2015/03/STARD-2015-checklist.pdf.
- Vasiliev Y.A., Vlazimirsky A.V., Omelyanskaya O.V., Arzamasov K.M., Chetverikov S.F., Rumyantsev D.A., Zelenova M.A. Methodology for testing and monitoring artificial intelligence-based software for medical diagnostics // Digital Diagnostics. 2023. Vol. 4, N. 3. P. 252–267. doi: 10.17816/DD321971.
- Vasilev Y. A., Bobrovskaya T. M., Arzamasov K. M. et al. Medical datasets for machine learning: fundamental principles of standartization and systematization // Manager Zdravoohranenia. 2023. Vol. 4. P. 28–41. doi: 10.21045/1811-0185-2023-4-28-41.
- Vinogradova IA, Nizovtsova LA, Omelyanskaya OV. Innovative strategic session in the scientific activity of the Center for Diagnostics and Telemedicine // Digital Diagnostics. 2022. Vol. 3, N. 4. P. 414−420. DOI: https://doi.org/10.17816/DD111833.
- Certificate for state registration of a database No. 2024621476, the Russian Federation. MosMedData: tekstovye protokoly KT grudnoj polosti s nalichiem i otsutstviem priznakov rasshireniya legochnogo stvola, anevrizmy aorty, emfizemy, gidrotoraksa, kompressionnogo pereloma tel pozvonkov [MosMedData: textual reports of chest cavity CT studies with and without signs of pulmonary trunk dilation, aortic aneurysm, emphysema, hydrothorax, or compression fracture of the vertebral bodies] : No. 2023625239 : submitted 28.12.2023 : published 04.04.2024 / Yu.A. Vasilev, A.V. Vladzymyrskyy, O.V. Omelyanskaya [et al.] ; submitter Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department.
- Kalinina M.L., Svitachev A.P., Biswas D., Vishnu P. Comparison of awareness and attitudes toward artificial intelligence among Russian- and English-speaking students at Orenburg State Medical University // Digital Diagnostics. 2023. Vol. 4. N. 1S. P. 62–65. https://doi.org/10.17816/DD430346.
Supplementary files
