Learning radiologists’ annotation styles with multi-annotator labeling for improved neural network performance

Cover Page


Cite item

Full Text

Abstract

BACKGROUND: One of the common problems in labeling medical images is inter-observer variability. The same image can be labeled differently by doctors. The main reasons are the human factor, differences in experience and qualifications, different “radiology schools”, poor image quality, and unclear instructions. The influence of some factors can be reduced by proper organization of the annotation; however, the opinion of doctors frequently differs.

AIM: The study aimed to test whether a neural network with an additional module can learn the style and labeling features of different radiologists and whether such modeling can improve the final metrics of object detection on radiological images.

METHODS: For training artificial intelligence systems in radiology, cross-labeling, i.e., annotation of the same image by several doctors, is frequently used. The easiest way is to use labeling from each doctor as an independent example when training the model. Some methods use different rules or algorithms to combine annotation before training. Finally, Guan et al. use separate classification heads to model the labeling style of different doctors. Unfortunately, this method is not suitable for more complex tasks, such as detecting objects on an image. For this analysis, a machine learning model designed to detect objects of different classes on mammographic scans was used. This model is a neural network based on Deformable DETR architecture. A dataset consisting of 7,756 mammographic breast scans and 12,543 unique annotations from 19 doctors was used to train the neural network. For validation and testing, a dataset consisting of 700 and 300 Bi-Rads-labeled scans, respectively, was taken. In all data sets, the proportion of images with pathology was in the 15%–20% range. A unique index was assigned to each of the 19 doctors, and a special module at each iteration of the neural network training found a vector corresponding to this index. The vector was expanded to the size of the feature map of each level of the feature pyramid, and then attached by separate channels to the maps. Thus, the encoder and the decoder of the detector had access to the information about which doctor labeled the scan. The vectors were updated using the back-propagation method. Three methods were chosen for comparison:

  1. Basic model: Combining labels by different doctors using the “voting” method.
  2. New stylistic module: For predictions on the test dataset, a single doctor’s index was taken, which showed the best metrics on the validation dataset.
  3. New stylistic module: The indexes of the five doctors with the best metrics on the validation dataset were used for predictions on the test dataset. Weighted Boxes Fusion was chosen to combine the predictions.

The area under the receiver operating characteristic curve (ROC-AUC) was used as the primary metric on the test dataset (Bi-Rads 3, 4, and 5 categories were referred to pathology). The sum of maximum probabilities of detected malignant objects (malignant masses and calcinates) by cranio-caudal and medio-lateral oblique projections was assumed as the probability of malignancy for each method.

RESULTS: The following ROC-AUC metrics were obtained for the three methods: 0.82, 0.87, and 0.89.

CONCLUSIONS: The information about the labeling doctor allows the neural network to learn and model the labeling style of different doctors more effectively. In addition, this method may obtain an estimate of the uncertainty of the network’s prediction. The use of embedding from different doctors, leading to different predictions, may mean that this data is difficult for an artificial intelligence system to process.

Full Text

BACKGROUND: One of the common problems in labeling medical images is inter-observer variability. The same image can be labeled differently by doctors. The main reasons are the human factor, differences in experience and qualifications, different “radiology schools”, poor image quality, and unclear instructions. The influence of some factors can be reduced by proper organization of the annotation; however, the opinion of doctors frequently differs.

AIM: The study aimed to test whether a neural network with an additional module can learn the style and labeling features of different radiologists and whether such modeling can improve the final metrics of object detection on radiological images.

METHODS: For training artificial intelligence systems in radiology, cross-labeling, i.e., annotation of the same image by several doctors, is frequently used. The easiest way is to use labeling from each doctor as an independent example when training the model. Some methods use different rules or algorithms to combine annotation before training. Finally, Guan et al. use separate classification heads to model the labeling style of different doctors. Unfortunately, this method is not suitable for more complex tasks, such as detecting objects on an image. For this analysis, a machine learning model designed to detect objects of different classes on mammographic scans was used. This model is a neural network based on Deformable DETR architecture. A dataset consisting of 7,756 mammographic breast scans and 12,543 unique annotations from 19 doctors was used to train the neural network. For validation and testing, a dataset consisting of 700 and 300 Bi-Rads-labeled scans, respectively, was taken. In all data sets, the proportion of images with pathology was in the 15%–20% range. A unique index was assigned to each of the 19 doctors, and a special module at each iteration of the neural network training found a vector corresponding to this index. The vector was expanded to the size of the feature map of each level of the feature pyramid, and then attached by separate channels to the maps. Thus, the encoder and the decoder of the detector had access to the information about which doctor labeled the scan. The vectors were updated using the back-propagation method. Three methods were chosen for comparison:

  1. Basic model: Combining labels by different doctors using the “voting” method.
  2. New stylistic module: For predictions on the test dataset, a single doctor’s index was taken, which showed the best metrics on the validation dataset.
  3. New stylistic module: The indexes of the five doctors with the best metrics on the validation dataset were used for predictions on the test dataset. Weighted Boxes Fusion was chosen to combine the predictions.

The area under the receiver operating characteristic curve (ROC-AUC) was used as the primary metric on the test dataset (Bi-Rads 3, 4, and 5 categories were referred to pathology). The sum of maximum probabilities of detected malignant objects (malignant masses and calcinates) by cranio-caudal and medio-lateral oblique projections was assumed as the probability of malignancy for each method.

RESULTS: The following ROC-AUC metrics were obtained for the three methods: 0.82, 0.87, and 0.89.

CONCLUSIONS: The information about the labeling doctor allows the neural network to learn and model the labeling style of different doctors more effectively. In addition, this method may obtain an estimate of the uncertainty of the network’s prediction. The use of embedding from different doctors, leading to different predictions, may mean that this data is difficult for an artificial intelligence system to process.

×

About the authors

Evgeniy D. Nikitin

Medical Screening Systems LLC

Author for correspondence.
Email: e.nikitin@celsus.ai
ORCID iD: 0000-0001-7181-1036
Russian Federation, Saint Petersburg

References

  1. Jensen MH, Jørgensen DR, Jalaboi R, Hansen ME, Olsen MA. Improving uncertainty estimation in convolutional neural networks using inter-rater agreement. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Vol. 11767. Cham: Springer; 2019. P. 540–548. doi: 10.1007/978-3-030-32251-9_59
  2. Jungo A, Meier R, Ermis E, et al. On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Vol. 11070. Cham: Springer; 2018. P. 682–690. doi: 10.1007/978-3-030-00928-1_77
  3. Guan MY, Gulshan V, Dai AM, Hinton GE. Who Said What: Modeling Individual Labelers Improves Classification. Proceedings of the AAAI Conference on Artificial Intelligence. 2017;32(1). doi: 10.1609/aaai.v32i1.11756
  4. Zhu X, Su W, Lu L, et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159. doi: 10.48550/arXiv.2010.04159
  5. Solovyev R, Wang W, Gabruseva T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing. 2021;107:104117. doi: 10.1016/j.imavis.2021.104117

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2023 Eco-Vector

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

СМИ зарегистрировано Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор).
Регистрационный номер и дата принятия решения о регистрации СМИ: серия ПИ № ФС 77 - 79539 от 09 ноября 2020 г.


This website uses cookies

You consent to our cookies if you continue to use our website.

About Cookies