Diagnosis of intracranial hemorrhages based on brain computed tomography with artificial intelligence
- Authors: Khoruzhaya A.N.1, Arzamasov K.M.1, Kodenko M.R.1, Kremneva E.I.1,2, Burenchev D.V.1
-
Affiliations:
- Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
- Russian Center of Neurology and Neurosciences
- Issue: Vol 6, No 2 (2025)
- Pages: 214-228
- Section: Original Study Articles
- Submitted: 13.01.2025
- Accepted: 05.02.2025
- Published: 08.07.2025
- URL: https://jdigitaldiagnostics.com/DD/article/view/645364
- DOI: https://doi.org/10.17816/DD645364
- EDN: https://elibrary.ru/RFYVMC
- ID: 645364
Cite item
Abstract
BACKGROUND: Intracranial hemorrhages are associated with high mortality and risk of disability, requiring prompt and accurate diagnosis, particularly within the first 24 hours. The use of artificial intelligence technologies in analyzing brain computed tomography scans can shorten diagnostic time and improve diagnostic quality. The relevance of this study is emphasized by the limited number of certified artificial intelligence services for detecting intracranial hemorrhages in Russia and lacking data on their long-term effectiveness, highlighting the need for multicenter monitoring to assess the stability and accuracy of such systems in clinical practice.
AIM: The study aimed to assess the diagnostic accuracy and stability of an artificial intelligence service in detecting intracranial hemorrhages on non-contrast brain computed tomography scans in a multicenter clinical monitoring setting for 18 months.
METHODS: Anonymized brain computed tomography scans were used. The artificial intelligence service underwent a three-phase evaluation to evaluate its diagnostic accuracy and clinical performance using limited datasets. Two radiologists specializing in neuroimaging examined 80 brain computed tomography scans each month for 18 months, which had been preprocessed by the artificial intelligence service and randomly selected from the clinical workflow. The results were analyzed using ROC analysis with sensitivity, specificity, accuracy, and area under the curve.
RESULTS: During clinical monitoring, 1200 brain computed tomography scans were analyzed, with signs of intracranial hemorrhage detected in 48.3% of the scans. Based on the binary classification of intracranial hemorrhage presence or absence performed by the artificial intelligence service, the following diagnostic metrics were obtained: sensitivity, 97.4% (95.8–98.5); specificity, 75.4% (71.8–78.7); accuracy, 86.0% (83.9–87.9); and area under the curve, 94% (92.6–95.3). Eventually, a significant moderate positive correlation was observed in most diagnostic metrics and the time variable, except for sensitivity, which was affected by an update to the service version. However, full concordance between artificial intelligence-based markings and radiologist conclusions was noted in 28.5% of cases of identified intracranial hemorrhage, whereas discrepancies were found in 71.5%. The refined diagnostic metrics for cases with complete agreement with the radiologists’ report were as follows: sensitivity, 26.6%; specificity, 73.8%; accuracy, 50.1%; and area under the curve, 49.6%.
CONCLUSION: The current configuration of the artificial intelligence service allows ruling out intracranial hemorrhage with very high probability, which may be useful in the initial triaging of patients in emergency settings. However, low values of refined metrics indicate considerable discrepancies between radiologist reports and service-generated results regarding the interpretation of pathological findings.
Full Text
BACKGROUND
Intracranial hemorrhage (ICH) is a potentially life-threatening acute condition associated with blood extravasation into brain tissue and may occur either spontaneously or as a result of head trauma or surgical intervention. According to global statistics, the incidence of ICH is more than 25 cases per 100,000 person-years [1]. In the Russian Federation, hemorrhagic stroke is diagnosed annually in approximately 43,000 patients and accounts for 10%–15% of all cases of acute cerebrovascular events [2]. The etiology of nontraumatic ICH is heterogeneous and includes arterial hypertension, stroke, aneurysm rupture, vasculopathy, cerebral venous sinus thrombosis, arteriovenous fistula, malignant neoplasms, anticoagulant therapy, and, more rarely, inflammatory diseases [1]. ICH is associated with a high risk of mortality (up to 40%–50%) within the first 24 hours and disability among survivors [1, 2]. By their location, ICHs are commonly classified into epidural, subdural, subarachnoid, and parenchymal (intracerebral). They are differentiated based on clinical presentation, imaging characteristics, and prognosis [2]. Brain computed tomography (CT) is the primary diagnostic modality used to evaluate patients presenting to emergency departments with headache or focal neurologic deficits [2]. It is a relatively accessible and rapid imaging method, the results of which allow assessing disease severity and determining treatment strategy [3]. Early diagnosis of ICH within the first 24 hours is clinically important for reducing early mortality. This is achieved through evaluation of the involved brain regions, hemorrhage volume, and the presence of high-risk imaging markers (such as the swirl sign and spot sign), as well as consideration of the clinical context, which together enable timely selection of management strategy and organization of follow-up care [4, 5].
Artificial intelligence (AI) technologies are actively and successfully used in emergency neuroimaging for the diagnosis of acute ischemic stroke, neuroinfections, and spinal cord compression [6]. AI is also in demand for primary detection of ICH, as this condition has well-differentiated diagnostic features [7]. The feasibility of AI technologies for ICH diagnosis is driven by the need to accelerate the primary diagnostic process (from 512 to 19 minutes) [7] and to improve its accuracy [8], in particular through patient triage and prioritization of radiologists’ worklists [9, 10]. However, the clinical application of AI is only feasible with high diagnostic accuracy. It depends on the quality of training and validation datasets [11] and requires external confirmation of the obtained results, which remains one of the key challenges [12]. One possible solution is the implementation of preliminary and interim validation of diagnostic metrics using data from the medical information system in which the AI-based service is intended to be deployed. Such validation facilitates quality control to confirm the diagnostic performance of AI services over time and enables further refinement based on regular error analysis [13, 14]. Nevertheless, only a limited number of studies in international publications address long-term monitoring of the life cycle of AI-based services for ICH diagnosis with regular assessment of their performance, whereas such data are lacking in Russian sources.
Since 2020, an experiment on the use of computer vision technologies for medical image analysis has been conducted in the Moscow city healthcare system (Experiment), involving more than 50 AI-based services across 41 diagnostic domains.1 Four AI-based software products for the automatic analysis of digital brain CT images to detect ICH are currently authorized in the Russian Federation. In April 2022, the first service of this kind was launched in the Experiment, where it remained the only option for a considerable time.
AIM
The work aimed to evaluate the diagnostic accuracy and stability of an AI service to detect ICH based on non-contrast brain CT in a multicenter clinical monitoring setting over an 18-month period.
METHODS
Study Design
This was a retrospective, multicenter, clinical monitoring, lasting 18 months.
Artificia Intelligence
The study object was CELS® software, Russia, developed by Medical Screening Systems LLC (AI service). At the stage of application for participation in the Experiment in April 2022, this AI service had been trained on a dataset of more than 15,000 anonymized diagnostic study results from two healthcare facilities. The cases were labeled as either normal or pathological (ICH). The training dataset included all CT examinations containing the following types of hemorrhage:
- subdural
- intracerebral
- epidural
Pathological images accounted for 60% of the total training dataset.
To establish baseline performance metrics, the AI service underwent preliminary clinical and technical testing, which was required for participation in the Experiment. An independent dataset that was not used in the training dataset was created. It comprised 260 CT examinations: 130 with identified pathological conditions and 130 classified as normal. Based on these data, a contingency table was generated with the distribution of diagnostic outcomes (false positives, false negatives, true positives, and true negatives), after which analytical validation metrics were calculated. The following mean diagnostic accuracy values were obtained: area under the curve (AUC), 0.89; sensitivity, 0.84; specificity, 0.74; and accuracy, 0.79.
The input data (brain CT examinations) were processed in the DICOM (Digital Imaging and Communications in Medicine) format. Results included:
- A textual report (DICOM SR)
- Images with annotated pathological regions (DICOM SC)
- A numerical probability score indicating the likelihood of a pathological condition in each examination.
The AI-analysis results were automatically integrated and made available together with the source images in the Unified Radiological Information System (URIS), which is part of the regional state healthcare information system, the Unified Medical Information and Analytical System of Moscow (UMIAS).
Binary classification was used in this study to determine the probability of the condition (ICH). We did not evaluate the accuracy of classification for individual hemorrhage types, segmentation accuracy, or hemorrhage volume.
Testing of Artificial Intelligence–Based Service
According to the testing and monitoring methods for artificial intelligence–based software developed and validated in the Experiment [15], the AI service underwent a three-stage evaluation before its integration into the core URIS UMIAS for CT scan analysis.
- Technical compatibility of the algorithm with the processed data was successfully verified at the self-testing stage.
- The completeness and adequacy of the AI service tools, as well as the feasibility of performing a diagnostic task, were assessed during functional testing.
- Calibration testing was conducted to evaluate the clinical performance and diagnostic metrics of the service.
Both technical and clinical perspectives of the AI service were assessed at the functional testing stage. The availability and performance of its functionality were evaluated in accordance with the basic diagnostic2 and functional requirements3 developed in the Experiment. The basic diagnostic requirements define the mandatory and optional components of the AI service output, as well as the format of its presentation. For the clinical task of ICH diagnosing based on brain CT, an AI service should obligatorily provide the following: a numerical probability of hemorrhage; type(s) of hemorrhage (epidural, subdural, subarachnoid, intracerebral); hemorrhage volumes (except for subarachnoid hemorrhage); and segmentation of hyperdense regions on the CT images. The basic functional requirements identify images that the AI service should process, as well as the content-related and formal characteristics of the generated results.
The stage of calibration testing, which followed functional testing, was aimed at confirming or refuting the performance metrics declared by the AI service developer. The numbers of true-positive, true-negative, false-positive, and false-negative results were calculated as a fourfold contingency table, based on which the main evaluation metrics were derived: AUROC, sensitivity, specificity, accuracy, and the proportions of false-negative and false-positive results. Additionally, the minimum, mean, and maximum processing time for a single CT examination were recorded. The following benchmark criteria were adopted for this clinical task:
- AUROC ≥ 0.81
- Time for image acquisition, processing, and transmission of the analysis results ≤ 6.5 minutes
- Proportion of successfully processed scans ≥ 90% [16, 17].
Over an 18-month period, a total of three calibration tests were performed using a dataset containing brain CT scans with a balanced class distribution of 1:1. After each calibration test of the AI service, a protocol was generated to include information on the service name, type, vendor company, as well as data on processed images, obtained performance metrics, and the decision on whether the AI service met the established benchmark criteria for continued operation within the URIS UMIAS.
Self-testing was performed using publicly available anonymized diagnostic images in DICOM format, accompanied by an Excel file specifying the imaging modality, type of diagnostic procedure, manufacturer, and a model of the diagnostic device.
Functional and calibration testing were performed using the MosMedData4 reference dataset. For functional testing, 5 CT scans were used (2 with pathological findings, 2 normal, and 1 with artifacts), whereas for calibration testing, 100 CT scans were used (50 with pathological findings and 50 without).
Clinical Monitoring
On April 28, 2022, the AI service was connected to the real-time processing of brain CT scans from 56 inpatient healthcare institutions. Data on the processing results were collected from April 2022 to September 2023. A total of 191,928 brain CT scans
were processed. Each month, 80 CT scans were randomly selected for expert evaluation with a class balance of 70:30 (70% and 30% for suspected pathological condition and normal findings, respectively, according to the AI assessment), in accordance with the developed methodology [18].
Two radiologists specializing in neuroimaging, each with more than 3 years of experience, evaluated these CT studies based on two main criteria:
- Concordance of interpretation (conclusion)
- Concordance of localization (marking of the pathological region).
Each criterion had four possible outcomes:
- Complete concordance
- Partially correct assessment
- False-positive results, when the service detected hemorrhage where none was present
- False-negative results, when the service failed to detect hemorrhage despite its presence.
Ethics Approval
The study design was approved by the Independent Ethics Committee of the Moscow Radiological Society (Extract from Minutes No. 2 of the IEC of MRS RSRR dated February 20, 2020) and registered on ClinicalTrials (NCT04489992).
Statistical Analysis
Receiver operating characteristic (ROC) curve analysis using a specially developed Web-based tool5 was performed to process obtained data. According to empirical evidence, the minimum dataset size required for testing an AI service under periodic monitoring conditions should be at least 400 CT scans, with a pathological case proportion of no less than 10% [19]. However, the actual dataset used in this study exceeded this threshold and comprised 1200 brain CT scans, with a pathological condition rate of 48.3%, which was fully aligned with the study objectives. The following diagnostic performance metrics were calculated for the AI service: sensitivity, specificity, accuracy, and the AUROC. Given the binary output of the service, AUROC was calculated based on the obtained sensitivity and specificity values. A false-positive result was defined as a case in which the service indicated the presence of ICH in the absence of a pathological condition according to the expert radiologist, whereas a false-negative result was defined as failure of the service to detect ICH when it was confirmed by the expert. All overall diagnostic performance metrics of AI service presented in the results section were calculated with a 95% confidence interval (CI) using the binomial test, as the analyzed datasets contained binarized outcomes. To assess the presence and strength of associations between diagnostic performance metrics and service operation time, the Pearson correlation coefficient (r) was used. Comparison of diagnostic metrics between calibration testing phases was performed using the Mann–Whitney U test. A one-sided version of the test was applied with the alternative hypothesis that the median value of a given metric before the third calibration test was lower than that after the third calibration test. A significant increase in metric values after the third calibration testing was anticipated. The significance level for hypothesis testing was set at 0.05.
RESULTS
Calibration Testing
During the first, second, and third stages of the calibration test, 9, 2, and 1 CT studies, respectively, failed to be processed due to the technical error “failure to send for testing.” ROC curves were constructed based on the correctly processed brain CT studies and are presented in Fig. 1.
Fig. 1. Receiver operating characteristic curves from the calibration tests of the artificial intelligence service designed for automatic analysis of brain computed tomography scans for intracranial hemorrhage: a, first test; b, second test; c, third test.
Table 1 summarizes the numerical metrics obtained in the first, second, and third calibration tests.
Table 1. Performance metrics of the artificial intelligence service designed for automatic analysis of brain computed tomography images for intracranial hemorrhage detection based on three calibration tests
Parameters | Calibration 1 0.6.11 | Calibration 2 0.7.11 | Calibration 3 0.8.31 |
AUROC, % (95% CI) | 96 (92–100) | 98 (96–99) | 96 (91–99) |
Sensitivity, % (95% CI) | 89 (80–98) | 90 (81–98) | 84 (73–94) |
Specificity, % (95% CI) | 96 (90–100) | 98 (94–100) | 98 (94–100) |
Accuracy, % (95% CI) | 92 (87–98) | 94 (89–99) | 91 (85–97) |
Proportion of false-negative results, % | 11 | 10 | 16 |
Proportion of false-positive results, % | 4 | 2 | 2 |
Processing time, s | 73 | 73 | 85 |
Note. Calibration 1, 2, and 3 correspond to the first, second, and third calibration tests performed at 0, 3, and 7 months of operation; 1, aritificial intelligence service version; AUROC, area under the receiver operating characteristic curve. | |||
The need for repeated calibration was determined by updates to the software, which could potentially degrade its performance metrics. Calibration testing was conducted after each update involving changes to the core of the AI service. However, in all three cases, differences in performance metrics were not significant (p > 0.05).
Clinical Monitoring
Table 2 shows the results of clinical monitoring.
Table 2. Confusion matrix of the artificial intelligence service and performance metrics by month
Month | Se, % | Sp, % | Ac, % | AUROC, % | TP, n | TN, n | FP, n | FN, n |
Calibration 1 | ||||||||
1 | 100 | 39.4 | 50.0 | 93.1 | 14 | 26 | 40 | 0 |
2 | 100 | 46.0 | 57.5 | 94.7 | 17 | 29 | 34 | 0 |
3 | 100 | 42.2 | 53.7 | 99.4 | 16 | 27 | 37 | 0 |
Calibration 2 | ||||||||
4 | 97.5 | 82.5 | 90.0 | 94.6 | 39 | 33 | 7 | 1 |
5 | 93.5 | 61.2 | 73.8 | 86.2 | 29 | 30 | 19 | 2 |
6 | 94.6 | 72.1 | 82.5 | 90.2 | 35 | 31 | 12 | 2 |
7 | 97.1 | 66.7 | 80.0 | 90.0 | 34 | 30 | 15 | 1 |
Calibration 3 | ||||||||
8 | 100 | 75.6 | 87.5 | 93.9 | 39 | 31 | 10 | 0 |
9 | 95.3 | 79.5 | 87.8 | 92.8 | 41 | 31 | 8 | 2 |
10 | 100 | 71.1 | 83.8 | 92.8 | 35 | 32 | 13 | 0 |
11 | 100 | 68.9 | 82.5 | 92.2 | 35 | 31 | 14 | 0 |
12 | 97.6 | 76.9 | 87.5 | 93.1 | 40 | 30 | 9 | 1 |
13 | 100 | 78.0 | 88.8 | 94.5 | 39 | 32 | 9 | 0 |
14 | 97.6 | 73.7 | 86.3 | 92.2 | 41 | 28 | 10 | 1 |
15 | 97.4 | 83.3 | 90.0 | 94.7 | 37 | 35 | 7 | 1 |
16 | 95.6 | 82.9 | 90.0 | 93.8 | 43 | 29 | 6 | 2 |
17 | 94.4 | 77.3 | 85.0 | 91.7 | 34 | 34 | 10 | 2 |
18 | 100 | 88.9 | 95.0 | 97.2 | 44 | 32 | 4 | 0 |
Note. Calibration 1, 2, and 3 correspond to the first, second, and third calibration tests. Cells in gray indicate data excluded from further analysis. Ac, accuracy; AUROC, area under the receiver operating characteristic curve; FN, false negative; FP, false positive; Se, sensitivity; Sp, specificity; TN, true negative; TP, true positive. | ||||||||
The presented data indicate that the AI service stabilized its performance after the third month. The first 3 months (April–June, 2022) were a pilot phase, during which the developer refined the solution: processing of imaging findings was unstable, and there were many technical errors, which affected the performance. For this reason, this period was excluded from further analysis.
The overall scheme of the clinical monitoring process for the AI service is shown in Fig. 2.
Fig. 2. Workflow for expert evaluation of brain computed tomography studies during clinical monitoring: FN, false negative; FP, false positive; ICH+, intracranial hemorrhage present; ICH−, intracranial hemorrhage absent; TN, true negative; TP, true positive.
Since performance stabilization, a total of 1200 non-contrast brain CT scans have been analyzed (see Fig. 2). The mean patient age was 61.2 ± 18.6 years, with 39% women. Based on expert assessment (ground truth), signs of ICH were present in 580 CT scans (48.3%). The diagnostic metrics calculated for the entire period from months 3 to 18 are shown in Fig. 3. According to the results of the binary classification performed by the AI service for detecting ICH, the following performance metrics were obtained:
- sensitivity, 97.4% (95.8–98.5)
- specificity, 75.4% (71.8–78.7)
- accuracy, 86.0% (83.9–87.9)
- AUROC, 94% (92.6–95.3).
Fig. 3. Temporal trends of the diagnostic metrics of the artificial intelligence service relative to the results of two calibration tests. The x-axis represents metric values; the y-axis represents months. Dashed lines indicate the metric values obtained during the calibration tests.
Correlation analysis revealed a significant moderate positive correlation between the duration of system operation and the diagnostic metrics of specificity and accuracy (for both metrics: r = 0.5; p = 0.04), as well as AUROC (r = 0.6; p = 0.03). No significant association was found between sensitivity and the duration of AI service operation.
Comparison of diagnostic metrics between the second and third calibration tests, as well as the subsequent monitoring stage, demonstrated a significant increase only in sensitivity and specificity (p = 0.04).
In addition, refined diagnostic metrics were assessed. For their calculation, we evaluated not only the detection of pathology but also the accuracy of its localization. The concordance between the radiologist’s report and the service output was additionally evaluated. In this analysis, a true-positive result was defined as a case with concordance in both localization and description of the detected condition. Complete agreement between segmentation and description in the presence of ICH was achieved in 28.5% of cases (162 CT scans). Accordingly, discrepancies between the markup and the description were identified in 71.5% (404 CT scans). The refined diagnostic performance metrics were as follows:
- sensitivity, 26.6% (22.9–30.4)
- specificity, 73.8% (70.0–77.4)
- accuracy, 50.1% (47.1–53.0)
- AUROC, 49.6% (44.1–55.1).
Inaccuracies in the description were identified in 61 cases: the AI service correctly detected the presence of the condition on the images but either missed individual hemorrhagic foci in cases of multiple lesions or incorrectly classified the type of hemorrhage. Segmentation inaccuracies were recorded in 64 CT scans and included incorrect delineation of hemorrhagic areas. However, the most common type of inaccuracy involved concurrent errors in both the report and the segmentation (279 CT scans) (see Fig. 2).
One of the most common examples of partially correct detection comprised cases in which, in the presence of multiple hemorrhages, the AI service identified one type of hemorrhage while missing another. Thus, in Fig. 4a, an intracerebral hemorrhage in the left hemisphere was missed, whereas in Fig. 4c, the AI service correctly segmented the intracerebral hemorrhage in the left hemisphere but failed to detect intraventricular hemorrhage in the right hemisphere, as well as interhemispheric subarachnoid hemorrhage and hemorrhage within the cortical sulci of both hemispheres. Another frequent type of incorrect detection was partial segmentation of a hemorrhage with erroneous classification of its type (see Fig. 4c): on the right, a subdural hemorrhage was incorrectly identified by the AI service as subarachnoid. Much less frequently, examples of partially correct detection included completely accurate segmentation of the hemorrhagic lesion with incorrect identification of its type(s) (see Fig. 4b): on the right, an intracerebral hemorrhage was misclassified as subdural, whereas on the left, a subdural hemorrhage was assigned to the subarachnoid type.
Fig. 4. Examples of partially correct outputs generated by the artificial intelligence service: a, correct identification of hemorrhage type with incorrect segmentation; b, correct segmentation of hemorrhagic regions with misclassification of their type; c, partial detection of some hemorrhages with omission of others, with both segmentation and type identification being incorrect.
A total of 152 false-positive results were recorded. Their most common causes were segmentation of major arteries, venous sinuses, and partially calcified meningeal structures (Fig. 5a).
Fig. 5. Examples of false-positive (a) and false-negative (b) outputs generated by the artificial intelligence service.
The AI service failed to detect pathological changes in 14 brain CT images. False-negative results were most frequently observed in cases of subarachnoid hemorrhage, 8 cases (see Fig. 5b, center and right). Less frequently, missed detections included intraventricular hemorrhages, 2 cases (see Fig. 5b, left), intracerebral hemorrhages, 2 cases, and single cases of subdural and epidural hemorrhage.
DISCUSSION
This study is not the first clinical evaluation of AI services designed to detect ICH on non-contrast brain CT. However, its distinguishing features include its considerable duration (18 months) and multicenter design (56 inpatient facilities and 248 radiologists who provided the primary reports).
In our study, a significant increase in the median values of two out of four key diagnostic metrics was observed during clinical monitoring. At the same time, it should be noted that the magnitude of improvement varied. Initially high sensitivity values (Me, 95.8%) increased significantly over time (Me 97.4%; p = 0.04). Specificity, which demonstrated comparatively low baseline values (Me, 69.4%), also showed a statistically significant improvement (Me, 76.2%; p = 0.04). No statistically significant changes were observed for accuracy or AUROC (p = 0.1).
The differences between diagnostic metric values obtained during calibration testing and those observed during clinical monitoring were noteworthy. Calibration metrics were calculated using the Youden index with an optimal operating threshold set at 75%. A significant positive correlation was found between time and specificity, accuracy, and AUROC (r = 0.5–0.6), whereas a significant increase over time was confirmed only for specificity and sensitivity.
ICH is an emergency condition that in most cases requires rapid clinical response and is associated with a high risk of clinical deterioration. Therefore, a key priority is configuring the AI service to maximize sensitivity for detecting a pathology, ensuring that patients with suspected ICH are brought to the radiologist’s attention as early as possible. However, automated threshold adjustment for optimizing diagnostic parameters must be implemented only insofar as it does not result in a critical reduction in specificity.
Several single-center studies have demonstrated high specificity, ranging 91%–98%, with lower sensitivity values (81%–94%) [20–23].
Similarly, few multicenter studies comparable in design and sample size to ours have reported relatively lower sensitivity despite high specificity [24–27]. McLouth et al. [24], using the commercially available AI service CINA® v1.0 (Avicenna.ai, France), reported sensitivity and specificity of 91.4% and 97.5%, respectively, in a cohort of 814 patients with a pathology rate of 31%. Comparable performance metrics were described by Kundisch et al. [25], who utilized the commercial AIDOC (Israel) software to detect ICH and obtained sensitivity and specificity values of 87.6% and 92.8%, respectively. The study was conducted on a sample of 4946 patients, with a pathology proportion of 5%. Two recent large-scale studies by Del Gaizo et al. [26] with a sample size of 58,321 (pathology rate, 2.7%) and Pettet et al. [27] with a sample size of 1315 (pathology rate, 8.5%), also using commercial AI systems (CINA® v1.0 [Avicenna.ai, France] and qER® v2.0 [Qure.ai, India]), reported sensitivity and specificity values of 75.6% and 92.1%, and 85.7% and 94.3%, respectively.
According to the systematic review by Mäenpää et al. [28], assessing the diagnostic accuracy of AI models for emergency interpretation of brain CT scans under external clinical validation conditions, most commercially available AI services demonstrated generally lower sensitivity and positive predictive value, which reflects the proportion of false-positive findings. Such performance indicates weaker generalizability and is suboptimal for triage and worklist-prioritization scenarios that rely on flagging
ICH-positive scans, as it carries a higher risk of alert fatigue. In our study, although we observed high sensitivity, the specificity of 75.4% was driven by many false-positive results and suggests a potentially additional cognitive load for radiologists. Several studies have attributed reduced predictive value to lower prevalence of the target condition, reflecting the natural influence of baseline condition frequency on performance metrics [23, 29]. The acceptable threshold of false-positive outputs for an AI service operating in a real-time clinical workflow needs further investigation.
Limitations in the use of the full functionality of the AI service (accurate identification of hemorrhage type and localization) are also reflected in the low refined metrics (except for specificity). It should be emphasized, however, that these refined metrics were calculated only for cases in which the AI output fully matched the expert radiologist’s assessment, both in segmentation and hemorrhage type. The overall diagnostic performance metrics were relatively high, with sensitivity in particular exceeding values reported in prior studies.
Several authors have reported excellent agreement in volumetric segmentation of pathological hyperdense regions caused by ICH on brain CT between commercially available AI services and semi-automatic annotation methods. However, Schmitt et al. [30] note that, even with a sensitivity and specificity of 91% and 89%, respectively (in a dataset with a 50% pathology rate), the AI system may function as a second opinion for the radiologist but is not suitable for independent use, a view shared by other investigators. Our findings indicate that full concordance between the AI-generated descriptions and segmentations of pathological regions and the radiologists’ conclusions could not be achieved in even one-third of all pathological cases, largely due to the coexistence of multiple hemorrhage types. This underscores that the AI service may be used as an auxiliary tool for initial detection; however, until substantial improvements are made in segmentation quality and hemorrhage-type classification, its use as a full-fledged clinical decision-support system for detailed characterization of hemorrhage types and their volumes is not justified.
Continuous refinement of AI services and adaptation to evolving clinical environments is both necessary and technically feasible [31]. Using deep convolutional neural networks as a foundation allows these systems to more effectively extract and analyze complex, visually imperceptible imaging features. They interpret these features within a logic that is fundamentally different from human reasoning, a capacity that may indeed improve diagnostic accuracy [26, 32]. Training on new data contributes to better performance over time [33], as demonstrated in our study with an observation period exceeding one year. In addition, it is necessary to implement regular (preferably monthly) performance monitoring of AI-service on independent clinical data [21] and establish systematic feedback from radiologists. This would enable developers to determine whether additional training data are required and to refine threshold settings, thereby optimizing the balance between sensitivity and accuracy [32].
There is another important reason why continuous clinical monitoring of the AI service’s performance is necessary. Our data show that it provides a more objective performance assessment than laboratory testing, even when external validation datasets are used (in our case, calibration testing). Furthermore, the relatively low specificity should prompt clinicians to use the AI service with caution in real-world practice, as insufficient oversight may lead to an increased number of unwarranted hospital admissions or even unnecessary surgical interventions [34]. At the same time, the high sensitivity allows hemorrhage to be ruled out with high certainty in cases of acute ischemic stroke, thereby facilitating assessment of eligibility for thrombolytic therapy [27]. Thus, optimal use of AI by radiologists requires an understanding of the scenarios in which the system is likely to generate inaccurate outputs. In addition, automated estimation of hemorrhage volume by AI service (if such functionality is available) may serve as an effective means of objectifying segmentation accuracy. This issue should be addressed in future studies.
Study Limitations
Our study has several limitations. First, we did not analyze the diagnostic metrics of the AI service for each specific type of hemorrhage or evaluate the accuracy of the segmented area of pathological regions. The concept of our retrospective multicenter observational analysis was aimed at identifying changes in the system’s performance over time based on actual clinical practice. Second, the dataset used for clinical monitoring was enriched with ICH cases (~50%) and did not reflect the true prevalence of the condition in the general population (~8–12%). This may have contributed to an increased number of false-positive results and to lower specificity compared with the performance metrics reported by the developer. This underscores the need to standardize clinically oriented training and evaluation of AI systems under appropriate conditions. Nevertheless, the persistently high sensitivity of the AI service, even with an increased proportion of cases with the condition in the dataset, indicates a strong capability of detecting clinically critical abnormalities, which should be considered an advantage in the context of emergency care.
CONCLUSION
Over the 18-month retrospective monitoring period of an AI service for detecting ICH on non-contrast brain CT across 56 hospitals in Moscow, the system demonstrated promising performance, with very high sensitivity (97.4%) and reasonable specificity (75.4%), both of which improved over time. However, the low refined metrics (sensitivity and accuracy of 26.6% and 50.1%, respectively) indicate substantial discrepancies between radiologists’ assessments and AI outputs, driven by incomplete segmentation of pathological regions and misclassification of ICH types. Radiologists should be aware of the operational characteristics of AI systems in clinical practice and recognize that a positive result does not always indicate true hemorrhage and that an identified hemorrhage may not be the only one present and may not be accurately segmented. Developers of such software must focus on reducing the number of false-positive outputs and improving the performance of the AI service to ensure that its functions are clinically useful. Nonetheless, the current configuration enables hemorrhage exclusion with very high probability, which is particularly valuable for emergency patient triage in admission departments.
ADDITIONAL INFORMATION
Author contributions: A.N. Khoruzhaya: published data search and analysis, AI service testing, monitoring data analysis, writing—original draft, writing—review & editing; K.M. Arzamasov: study conceptualization, organization of AI service testing, monitoring data collection, writing—review & editing; M.R. Kodenko: formal analysis, writing—original draft, writing—review & editing; E.I. Kremneva: published data search and analysis, monitoring data analysis, writing—review & editing; D.V. Burenchev: study conceptualization, published data analysis, writing—review & editing. All the authors approved the version of the manuscript to be published and agreed to be accountable for all aspects of the work, ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Acknowledgments: The authors express their gratitude to Prof. A.V. Petryaykin, Dr. Sci. (Medicine), Chief Researcher at the Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies, for his assistance in the study.
Ethics approval: The study design was approved by the Independent Ethics Committee of the Moscow Radiological Society (Extract from Protocol No. 2 of the IEC of MRO RSRR dated February 20, 2020) and registered on ClinicalTrials (NCT04489992).
Funding sources: This article was part of the research project Scientific Methods for the Sustainable Development of Artificial Intelligence Technologies in Medical Diagnostics (Unified State Information Accounting System No. 123031500004-5).
Disclosure of interests: The authors have no relationships, activities, or interests for the last three years related to for-profit or not-for-profit third parties whose interests may be affected by the content of the article.
Statement of originality: No previously published material (text, images, or data) was used in this work.
Data availability statement: The editorial policy regarding data sharing does not apply to this work.
Generative AI: No generative artificial intelligence technologies were used to prepare this article.
Provenance and peer review: This paper was submitted unsolicited and reviewed following the fast-track procedure. The peer review process involved three external reviewers and the in-house science editor.
1 Artificial Intelligence Technologies in Healthcare. In: Center for Diagnostics and Telemedicine [Internet]. Moscow: Center for Diagnostics and Telemedicine; 2020–2024. Available at: https://mosmed.ai/. Accessed on December 13, 2024.
2 Basic Diagnostic Requirements for AI Service Output [Internet]. In: Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies; 2024–2024. Available at: https://mosmed.ai/ai/docs/. Accessed on December 13, 2024.
3 Basic Functional Requirements for AI Service Output [Internet]. In: Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies; 2024–2024. Available at: https://mosmed.ai/ai/docs/. Accessed on December 13, 2024.
4 Certificate of State Registration of the Database No. 2022620559 / 16.03.2022 Bull. No. 3. Morozov S.P., Pavlov N.A., Petryaykin A.V., et al. MosMedData: A Dataset of Diagnostic Brain Computed Tomography Images With and Without Signs of Intracranial Hemorrhage. Available at: https://www.elibrary.ru/item.asp?id=48137428. Accessed on December 13, 2024.
5 Tool for ROC Analysis of Diagnostic Tests [Internet]. In: Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies; 2022–2024. Available at: https://roc-analysis.mosmed.ai/. Accessed on December 13, 2024.
About the authors
Anna N. Khoruzhaya
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Author for correspondence.
Email: KhoruzhayaAN@zdrav.mos.ru
ORCID iD: 0000-0003-4857-5404
SPIN-code: 7948-6427
MD
Russian Federation, MoscowKirill M. Arzamasov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: ArzamasovK@zdrav.mos.ru
ORCID iD: 0000-0001-7786-0349
SPIN-code: 3160-8062
MD, Dr. Sci. (Medicine)
Russian Federation, MoscowMaria R. Kodenko
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: KodenkoM@zdrav.mos.ru
ORCID iD: 0000-0002-0166-3768
SPIN-code: 5789-0319
Cand. Sci. (Engineering)
Russian Federation, MoscowElena I. Kremneva
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies; Russian Center of Neurology and Neurosciences
Email: KremnevaE@zdrav.mos.ru
ORCID iD: 0000-0001-9396-6063
SPIN-code: 8799-8092
MD, Dr. Sci. (Medicine)
Russian Federation, Moscow; MoscowDmitry V. Burenchev
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies
Email: BurenchevD@zdrav.mos.ru
ORCID iD: 0000-0003-2894-6255
SPIN-code: 2411-3959
MD, Dr. Sci. (Medicine)
Russian Federation, MoscowReferences
- Li X, Zhang L, Wolfe CDA, Wang Y. Incidence and long-term survival of spontaneous intracerebral hemorrhage over time: a systematic review and meta-analysis. Frontiers in Neurology. 2022;13:819737. doi: 10.3389/fneur.2022.819737 EDN: MLOQRJ
- Hemorrhagic stroke: clinical guidelines. Moscow: Ministry of Health of the Russian Federation; 2022. (In Russ.) [cited 2024 Dec 12]. Available from: https://ruans.org/Text/Guidelines/hemorrhagic-stroke-2022.pdf
- Hostettler IC, Seiffge DJ, Werring DJ. Intracerebral hemorrhage: an update on diagnosis and treatment. Expert Review of Neurotherapeutics. 2019;19(7):679–694. doi: 10.1080/14737175.2019.1623671 EDN: JWSYUZ
- Woo D, Comeau ME, Venema SU, et al. Risk factors associated with mortality and neurologic disability after intracerebral hemorrhage in a racially and ethnically diverse cohort. JAMA Network Open. 2022;5(3):e221103. doi: 10.1001/jamanetworkopen.2022.1103 EDN: BVHNLU
- Yaghi S, Dibu J, Achi E, et al. Hematoma expansion in spontaneous intracerebral hemorrhage: predictors and outcome. International Journal of Neuroscience. 2014;124(12):890–893. doi: 10.3109/00207454.2014.887716
- Gong B, Khalvati F, Ertl-Wagner BB, Patlas MN. Artificial intelligence in emergency neuroradiology: current applications and perspectives. Diagnostic and Interventional Imaging. 2025;106(4):135–142. doi: 10.1016/j.diii.2024.11.002 EDN: DHXSGS
- Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ, et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. npj Digital Medicine. 2018;1(1):9. doi: 10.1038/s41746-017-0015-z EDN: BORIWC
- Seyam M, Weikert T, Sauter A, et al. Utilization of artificial intelligence–based intracranial hemorrhage detection on emergent noncontrast CT images in clinical workflow. Radiology: Artificial Intelligence. 2022;4(2):e210168. doi: 10.1148/ryai.210168 EDN: HEPSBX
- Davis MA, Rao B, Cedeno PA, et al. machine learning and improved quality metrics in acute intracranial hemorrhage by noncontrast computed tomography. Current Problems in Diagnostic Radiology. 2022;51(4):556–561. doi: 10.1067/j.cpradiol.2020.10.007 EDN: NHQFYC
- O’Neill TJ, Xi Y, Stehel E, et al. Active reprioritization of the reading worklist using artificial intelligence has a beneficial effect on the turnaround time for interpretation of head CT with intracranial hemorrhage. Radiology: Artificial Intelligence. 2021;3(2):e200024. doi: 10.1148/ryai.2020200024 EDN: LCDGTM
- Smorchkova AK, Khoruzhaya AN, Kremneva EI, Petryaikin AV. Machine learning technologies in CT-based diagnostics and classification of intracranial hemorrhages. Burdenko's Journal of Neurosurgery. 2023;87(2):85. doi: 10.17116/neiro20238702185EDN: JVZDST
- Yu KH, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Quality & Safety. 2018;28(3):238–241. doi: 10.1136/bmjqs-2018-008551
- Allen B, Dreyer K, Stibolt R, et al. Evaluation and real-world performance monitoring of artificial intelligence models in clinical practice: try it, buy it, check it. Journal of the American College of Radiology. 2021;18(11):1489–1496. doi: 10.1016/j.jacr.2021.08.022 EDN: NMKGVD
- Recht MP, Dewey M, Dreyer K, et al. Integrating artificial intelligence into the clinical practice of radiology: challenges and recommendations. European Radiology. 2020;30(6):3576–3584. doi: 10.1007/s00330-020-06672-5 EDN: WWDEXB
- Vasiliev YuA, Vlazimirskyy AV, Omelyanskaya OV, et al. Methodology for testing and monitoring artificial intelligence-based software for medical diagnostics. Digital Diagnostics. 2023;4(3):252–267. doi: 10.17816/DD321971 EDN: UEDORU
- Morozov SP, Vladzimirsky AV, Klyashtornyy VG, et al. Clinical acceptance of software based on artificial intelligence technologies (radiology). Moscow: Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies; 2019. EDN: GWJIMI
- Morozov SP, Vladzimirsky AV, Andreychenko AE, et al. Regulations for the preparation of data sets with a description of approaches to the formation of a representative data sample. Moscow: Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies; 2022. (In Russ.) EDN: XENAJE
- Chetverikov SF, Arzamasov KM, Andreichenko AE, et al. Approaches to sampling for quality control of artificial intelligence in biomedical research. Sovremennye tehnologii v medicine. 2023;15(2):19. doi: 10.17691/stm2023.15.2.02 EDN: FUKXYC
- Kodenko MR, Bobrovskaya TM, Reshetnikov RV, et al. Empirical approach to sample size estimation for testing of AI algorithms. Doklady Mathematics. 2024;110(S1):S62–S74. doi: 10.1134/S1064562424602063 EDN: VJHJRD
- Salehinejad H, Kitamura J, Ditkofsky N, et al. A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography. Scientific Reports. 2021;11(1):17051. doi: 10.1038/s41598-021-95533-2 EDN: SXLMCH
- Zia A, Fletcher C, Bigwood S, et al. Retrospective analysis and prospective validation of an AI-based software for intracranial haemorrhage detection at a high-volume trauma centre. Scientific Reports. 2022;12(1):19885. doi: 10.1038/s41598-022-24504-y EDN: IWNBET
- Ginat DT. Analysis of head CT scans flagged by deep learning software for acute intracranial hemorrhage. Neuroradiology. 2019;62(3):335–340. doi: 10.1007/s00234-019-02330-w EDN: WTOITQ
- Voter AF, Meram E, Garrett JW, Yu JPJ. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of intracranial hemorrhage. Journal of the American College of Radiology. 2021;18(8):1143–1152. doi: 10.1016/j.jacr.2021.03.005 EDN: GPJYDS
- McLouth J, Elstrott S, Chaibi Y, et al. Validation of a deep learning tool in the detection of intracranial hemorrhage and large vessel occlusion. Frontiers in Neurology. 2021;12:656112. doi: 10.3389/fneur.2021.656112 EDN: FFIXVV
- Kundisch A, Hönning A, Mutze S, et al. Deep learning algorithm in detecting intracranial hemorrhages on emergency computed tomographies. PLOS ONE. 2021;16(11):e0260560. doi: 10.1371/journal.pone.0260560 EDN: QPACKZ
- Del Gaizo AJ, Osborne TF, Shahoumian T, Sherrier R. Deep learning to detect intracranial hemorrhage in a national teleradiology program and the impact on interpretation time. Radiology: Artificial Intelligence. 2024;6(5):e240067. doi: 10.1148/ryai.240067 EDN: EHHAOO
- Pettet G, West J, Robert D, et al. A retrospective audit of an artificial intelligence software for the detection of intracranial haemorrhage used by a teleradiology company in the United Kingdom. BJR|Open. 2023;6(1):tzae033. doi: 10.1093/bjro/tzae033 EDN: DWNYCF
- Mäenpää SM, Korja M. Diagnostic test accuracy of externally validated convolutional neural network (CNN) artificial intelligence (AI) models for emergency head CT scans – A systematic review. International Journal of Medical Informatics. 2024;189:105523. doi: 10.1016/j.ijmedinf.2024.105523 EDN: HLVVYQ
- Eldaya RW, Kansagra AP, Zei M, et al. Performance of automated RAPID intracranial hemorrhage detection in real-world practice: a single-institution experience. Journal of Computer Assisted Tomography. 2022;46(5):770–774. doi: 10.1097/rct.0000000000001335 EDN: GRDZTF
- Schmitt N, Mokli Y, Weyland CS, et al. Automated detection and segmentation of intracranial hemorrhage suspect hyperdensities in non-contrast-enhanced CT scans of acute stroke patients. European Radiology. 2021;32(4):2246–2254. doi: 10.1007/s00330-021-08352-4 EDN: OLFWXI
- Warman R, Warman A, Warman P, et al. Deep learning system boosts radiologist detection of intracranial hemorrhage. Cureus. 2022;undefined:. doi: 10.7759/cureus.30264 EDN: IRZKDY
- Buchlak QD, Tang CHM, Seah JCY, et al. Effects of a comprehensive brain computed tomography deep learning model on radiologist detection accuracy. European Radiology. 2023;34(2):810–822. doi: 10.1007/s00330-023-10074-8 EDN: ZHIFOG
- Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. The Lancet Oncology. 2019;20(5):e262–e273. doi: 10.1016/S1470-2045(19)30149-4
- Kiefer J, Kopp M, Ruettinger T, et al. Diagnostic accuracy and performance analysis of a scanner-integrated artificial intelligence model for the detection of intracranial hemorrhages in a traumatology emergency department. Bioengineering. 2023;10(12):1362. doi: 10.3390/bioengineering10121362 EDN: EPLIBY
Supplementary files












