MosMedData: data set of 1110 chest CT scans performed during the COVID-19 epidemic

Sergey P. Morozov; Морозов Сергей Павлович; Anna E. Andreychenko; Андрейченко Анна Евгеньевна; Ivan A. Blokhin; Блохин Иван Андреевич; Pavel B. Gelezhe; Гележе Павел Борисович; Anna P. Gonchar; Гончар Анна Павловна; Alexander E. Nikolaev; Николаев Александр Евгеньевич; Nikolay A. Pavlov; Павлов Николай Александрович; Valeria Yu. Chernina; Чернина Валерия Юрьевна; Victor A. Gombolevskiy; Гомболевский Виктор Александрович

doi:10.17816/DD46826

MosMedData: data set of 1110 chest CT scans performed during the COVID-19 epidemic

Авторлар: Morozov S.P.¹, Andreychenko A.E.¹, Blokhin I.A.¹, Gelezhe P.B.¹, Gonchar A.P.¹, Nikolaev A.E.¹, Pavlov N.A.¹, Chernina V.Y.¹, Gombolevskiy V.A.¹
Мекемелер:
1. Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies, Department of Health Care of Moscow
Шығарылым: Том 1, № 1 (2020)
Беттер: 49-59
Бөлім: Datasets
##submission.dateSubmitted##: 12.10.2020
##submission.dateAccepted##: 11.12.2020
##submission.datePublished##: 30.12.2020
URL: https://jdigitaldiagnostics.com/DD/article/view/46826
DOI: https://doi.org/10.17816/DD46826
ID: 46826

Дәйексөз келтіру

Толық мәтін

Аннотация
Толық мәтін
Авторлар туралы
Әдебиет тізімі
Қосымша файлдар
Статистика

Аннотация

With the ongoing COVID-19 pandemic decreasing availability of polymerase chain reaction with reverse transcription and the snowballing growth of medical imaging, especially the number of chest computed tomography (CT) scans being performed, methods to augment and automate the image analysis, increasing productivity and minimizing human error are of particular importance. The creation of high-quality datasets is essential for the development and validation of artificial intelligence algorithms. Such technologies have sufficient accuracy in diagnosing COVID-19 in medical imaging. The presented large-scale dataset contains anonymized human CT scans with COVID-19 features as well as normal studies. Some studies were tagged by radiologists using binary pixel masks of regions of interest (e.g., characteristic areas of consolidation and ground-glass opacities). CT data were acquired between March 1, 2020, and April 25, 2020, and provided by municipal hospitals in Moscow, Russia. The presented dataset is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0).

Негізгі сөздер

artificial intelligence, COVID-19, machine learning, dataset, CT, chest

Толық мәтін

BACKGROUND

During the COVID-19 pandemic, most countries encountered a huge increase in the burden on health structures. More than ever, this situation required the careful use of financial and human resources. Unfortunately, the preventive measures taken in health facilities are not always sufficient to avoid the loss of health workers. The loss of trained specialists in emergency care, radiology, etc. is of particular concern. Computed tomography (CT) is considered to be the key tool for the diagnosis of pneumonia and the assessment of its progression. CT is performed in outpatient settings and is intended for patients with acute respiratory symptoms, as well as for those initially diagnosed with viral pneumonia requiring follow-up, and capable of recovering at home (under observation using telemedical technologies).

In in-patient facilities, CT is used for making the primary and differential diagnosis, assessing disease progression, and determining whether a patient should be admitted to the intensive care unit or discharged [1, 3, 4]. The increasing use of CTs is placing a heavy burden on the health care system. For example, in Moscow, the network of municipal outpatient CT centers is conducting approximately 90 examinations per CT scanner per day (with up to 163 examinations per day). Therefore, to standardize and streamline the clinical decision-making, specialists developed a classification model that, along with other symptoms, evaluates the severity of pulmonary tissue anomalies observed on CT scans (see Table 2). This classification according to the pulmonary parenchyma lesion volume in chest CT allows to predict lethal outcomes in COVID-19 [9]. Professional burnout and high risks of death among health professionals require image analysis automation, which will increase productivity and minimize errors [8]. Preliminary data show that artificial intelligence (AI) algorithms have sufficient accuracy for diagnosing COVID-19 (sensitivity: 90%, specificity: 96%, AUC: 0.96, overall accuracy: 76.37–98.26). [6, 10].

MATERIALS AND METHODS

Chest CT was performed on 42 CT scanners of the same model Toshiba Aquilion 64 (Canon Medical Systems, Japan). All examinations were performed according to the standard methods and protocols recommended by the manufacturer (Table 1):

One examination refers to a single patient and includes one three-dimensional reconstruction. The inclusion criteria were as follows: patient visit to an outpatient clinic, reorganized as Outpatient Computed Tomography Center during the pandemic as well as referral for a chest CT from the general practitioner under the obligatory health insurance.

Table 1. Methods of scanning, reconstructing images, and saving the database

Parameter set	Feature	Meaning and comment
Equipment	CT-scanner	Toshiba Aquilion 64 (Canon Medical Systems, Japan)
Equipment	Number of slices	64
Patients	Patient positioning	Gantry centered at the thorax Table height and alignment are adjusted such that the middle clavicular line is in the isocenter Hands above the head Instructions for breathing Patient education and breathing instruction before scanning
Patients	Clothing and foreign objects	All foreign objects should be removed from the scan area, including jewelry and chains around the neck. Underwear is acceptable.
Patients	Localizer/scout	- Was conducted at the chest level to limit the scanning to the lung range. - Was performed to find additional foreign objects at the scan level that could impair the image quality. - Breath-hold scan at breathing depth.
	Scanning range	The entire volume of the lungs, including 5 cm above and 5 cm below the lungs.
	Breathing phase	CT scan with breath-holding at inspiration depth.
	Field of view (FOV)	- Not less than 1 cm from the ribs (from 350 to 500 mm). - The breasts were included in the scanning area, but could be partially excluded from the field of view
Medical staff	Technician	He was in the control room and not in contact with the patient. Face-to-face contact with the positioning assistant was minimized for safety reasons.
Medical staff	Stacker	The positioning assistant is a medical officer of the Radiology Department who was transferred from the mammography X-ray technicians to the CT room in the form of additional personnel during the epidemic according to the order of the Moscow Department of Health. He was located in the scanner room (assisting with patient and table positioning) and in the corridor (during scanning). He was in contact with the patient.
Scanning protocol and image reconstruction, viewing, and interpretation	Gentri tilt	no
	Scan duration	≤ 10 seconds
	Contrast enhancement	no
	Oral contrast	no
	Voltage	120 kV
	Current	Automatic power modulation system «Sure exp.3D», built into the CT manufacturer. The system automatically adjusted the current strength to achieve a noise level of 10 HU for 5.0 mm-thick slices thick in the range of 80–500 mA. XY modulation - on
	Rotation speed	0,5 s
	Pitch	95,0
	Recon process	QDS+
Scanning protocol and image reconstruction, viewing, and interpretation	Number of CT series reconstructed	2 (with pulmonary and soft tissue kernel³)
	Convolution kernel for soft tissues	FC07 or FC18
	Convolution kernel for lungs	FC51
	Slice thickness	1.0 mm (same for both kernels)
	Increment	0.8 mm (same for both kernels)
	Iterative reconstruction	AIDR 3D was availible in only 5 tomographs, the rest - without iterative reconstruction algorithms - used FBP (filtered back projection).
	Software used for CT interpretation	AGFA Enterprise 8.0 Vitrea FX
	Maximum Intensity Projections (MIP), Minimum Intensity Projections (MIP), and Multiplanar Reconstruction (MPR)	Maximum Intensity Projections (MIP), Minimum Intensity Projections (MIP), and Multiplanar Reconstruction were used
	Artificial Intelligence Algorithms	They were used, but not for all examinations. In the case of machine learning, algorithms created an additional image series for the radiologist, highlighting the COVID-19 lung lesion. COVID-19 was shown as red rectangles, attracting the attention of the doctor. In addition, a summarized three-dimensional reconstruction of the lungs with red regions of interest was available. Quantitative information to estimate the degree of lung damage was not presented.
	Report turnaround time	from 10 min to 3 hours. In rare cases 24 hours.
	Protocol standartization	The structured report template was formed and regulated in the methodical recommendations, as well as implemented in the Unified Radiological Information Service, used for study reporting in the outpatient clinics.
	COVID-19 classification	Classification by the CT0-CT4 scale was used (see table).
	Second opinion	For 90% of all CT examinations from outpatient clinics, a second reading was performed.
	Effective dose calculation	DLP data from the automatically created DoseReport CT series were used. In the Russian Federation, according to the methodological guidelines (MU 2.6.1.2944-11) «Control of Effective Patient Doses during Medical Radiology», the effective dose is calculated by multiplying DLP by 0.017 (anatomic location-based index).
Dataset	Data acquisition	Unified Radiological Information Service, including AGFA Enterprise 8.0
	Initial data collection format	DICOM 3.0
	Plane	Axial
Data base	Slice thickness	1.0 mm
	Increment	8.0 mm (as every 10th slice is saved)
	Export file extenstion	NIfTI
	Annotation software in the form of binary masks with lung lesions	MedSeg® (© 2020 Artificial Intelligence AS)

Notes: CT — computed tomography; CT-1 – CT-4 — the degree of lung damage based on CT results; RR — respiratory movements rate; FiO₂ — oxygen concentration; SpO₂ — blood oxygen saturation.

The criteria for exclusion from the study included pregnancy and age under 18 years. Patients with blood oxygenation less than 93%, identified before the CT scan, were removed from the study and sent to be hospitalized by the ambulance service.

The dataset was developed in five stages as discussed below.

DATA COLLECTION

Data collection was conducted in the period from March 1 to April 25, 2020 in the outpatient clinics of Moscow City Health Care (Table 3).

Table 2. Lung lesion grading in COVID-19 and routing rules

Severity	CT category	Clinical Data	Decision
Zero	CT-0 Not consistent with pneumonia (including COVID-19).	–	Inform the attending physician. Refer to a specialist.
Mild	CT-1 Ground-glass opacities. Pulmonary parenchymal involvement =<25% OR absence of CT signs in the presence of typical clinical manifestations and relevant epidemiological history.	A. t <38.0ºС B. RR <20/min C. SpO₂ >95%	Follow-up at home using telemedicine technologies (mandatory telemonitoring)
Moderate	CT-2 Ground-glass opacities. Pulmonary parenchymal involvement 25–50%	A. t >38.5ºС B. RR 20–30/min C. SpO₂ 95%	Follow-up at home by a primary care physician
Severe	CT-3 Ground-glass opacities. Pulmonary consolidation. Pulmonary parenchymal involvement of 50–75%. Lung involvement increased in 24–48 hours by 50% with respiratory impairment per the follow-up studies.	One or more signs on the background of fever: A. t >38,5ºС B. RR ≥30/min C. SpO₂ ≤95% D. Partial pressure of oxygen (PaO₂)/ Fraction of inspired oxygen (FiO₂) ≤300 mmHg (1 mmHg=0,133 kPa)	Immediate admission to a COVID specialized hospital. In a hospital setting: immediate transfer to the intensive care and resuscitation unit. Emergency computed tomography (if not done before).
Critical	CT-4 Diffuse ground-glass opacities with consolidations and reticular changes. Hydrothorax (bilateral, more on the left). Pulmonary parenchymal involvement >=75%.	Signs of shock, multiple organ failure, and respiratory failure.	Emergency medical care. Immediate admission to a specialized hospital for patients diagnosed with COVID-19. In a hospital setting: immediate transfer to the intensive care and resuscitation unit. Emergency computed tomography (if not done before and when patient status allows for it).

Table 3: List of medical organizations where CT data was collected

Municipal Hospital (MH) № 19 Department of Health Care of Moscow	MH № 214	MH № 52
MH № 23	MH № 6	Diagnostic Center № 5
MH № 3	MH № 209	MH № 9
MH № 62	Diagnostic Center № 4	MH № 218
MH № 175	MH № 212	MH № 170
MH № 191	MH № 8	M. P. Conchalovsky hospital (outpatient and in-patient care)
MH № 195	MH № 64	MH № 134
MH № 115	Pediatric Diagnostic Center № 1	MH № 67
Diagnostic Center № 121	MH № 36	MH № 68
Diagnostic Center № 2	MH № 11	MH № 180
MH № 45	MH № 5	MH № 5
MH № 2	Moscow Research and Practical Center for Tuberculosis Control of the South-East Moscow District	MH № 46
MH № 166	Moscow Research and Practical Center for Tuberculosis Control of the Central and West Moscow Districts	MH № 12
MH № 220	MH № 66	Diagnostic Center № 3

This dataset (1110 studies) contains anonymized human lung CT scans (CT scans) with signs of COVID-19 (CT1-CT4) and without signs of COVID-19 (CT0) (Figure 1). Sample characteristics: 1110 individuals, of whom 42% were males, 56% females, 2% other/unknown; aged 18 to 97 years old, median age 47.

Figure 1: The order of forming a dataset.

Note: CT — computed tomography.

Figure 2: Examples of chest CT scans of patients with varying degrees of COVID-19 severity. Left to right, upper row: axial CT slices of patients with COVID-19 from mild (CT-1) to critical (CT-4) severity. Left to right, lower row: same CT data after tagging.

Figure 3: Data storage structure in the dataset.

At the first stage, all the examinations (n=1110) were distributed into five categories according to the classification (Table 2). The number of cases by categories: CT-0, 254 (22.8%); CT-1, 684 (61.6%); CT-2, 125 (11.3%); CT-3, 45 (4.1%); and CT-4, 2 (0.2%). Second, each study was saved in the NIfTI format and archived in the Gzip archive. During this process, only every 10th image (Instance) was saved in the final study file.

A small number of the CT scans (n = 50) was tagged by specialists from the Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Department of Health. During the markup, positive (white) pixels on the corresponding binary pixel mask were selected for each of the images. The obtained masks were saved in NIfTI format and then converted to the Gzip archives. MedSeg® annotation software (© 2020 Artificial Intelligence AS) was used to create the binary masks.

This software was used to tag only COVID-19 lesions, including ground-glass opacities, consolidation, small vessels, and bronchioles. The density thresholds for tagging were from −700 HU to −130 HU, but it could differ depending on the breathing depth. We excluded large vessels and bronchi, visually unchanged pulmonary parenchyma, motion artifacts (respiratory due to cough and respiratory failure), gravitational changes (if it was possible to reliably differentiate them), calcifications, and pleural effusion.

All chest CT scans used in the dataset have passed an independent external audit by radiologists from the Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Department of Health, the opinion of which was accepted as final to assess the severity of COVID-19 lung damage according to the adopted classification (CT0-CT4). These data were available in URIS in a structured form to constitute the final table of assessment results. Thus, all the studies were evaluated by at least two specialists. In addition, 50 studies were evaluated by three specialists, as they were annotated using the external MedSeg software.

The data set is intended for training, calibration, and the independent evaluation of AI algorithms (computer vision) [7]. The COVID-19 AI algorithms (computer vision) will help in the fight against this disease:

Examine patients in outpatient facilities for fast and consistent routing (including those based on CT0-4 criteria).
Prioritize studies with COVID-19 features in a worklist.
Perform a rapid and qualitative assessment of abnormal changes by comparing several studies.
Minimize the risk of errors and missed anomalies.

Currently, there is a wide range of publicly available COVID-19 data sets [2, 5]. However, this should not be seen as an obstacle, since the development of artificial intelligence algorithms requires large amounts of qualitative clinical information that are representative of real patient populations. In addition, Artificial Intelligence algorithms should be tested using new data sets that were not used in the training and calibration stages. The more data available in open sources, the better for developers. The available data sets are relatively small and rarely contain additional information such as tags and/or binary masks for regions of interest (ROI).

How to use the dataset

Permanent link: https://mosmed.ai/datasets/covid19_1110. This data set is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) license.

ADDITIONAL INFO

Funding. The study had no sponsorship.

Conflict of interest. The authors declare no conflict of interest regarding this publication.

Authors contribution. S.P. Morozov — concept of research; A.E. Andreychenko — study design, data set formation; I.A. Blokhin — data markup, manuscript editing; P.B. Gelezhe — search for publications on the topic of the article, data markup; A.P. Gonchar — data markup, expert assessment of information; A.E. Nikolaev — data markup, expert assessment of information; N.A. Pavlov, V.Yu. Chernina, V.A. Gombolevsky — manuscript writing, preparing the dataset. All authors made a significant contribution to the study and preparation of the article, read and approved the final version before publication.

Acknowledgements. The authors express their gratitude to all doctors of the Moscow Health Department who are fighting the epidemic.