Diagnostic Performance of a Deep Learning Model Deployed at a National COVID-19 Screening Facility for Detection of Pneumonia on Frontal Chest Radiographs

(1) Background: Chest radiographs are the mainstay of initial radiological investigation in this COVID-19 pandemic. A reliable and readily deployable artificial intelligence (AI) algorithm that detects pneumonia in COVID-19 suspects can be useful for screening or triage in a hospital setting. This study has a few objectives: first, to develop a model that accurately detects pneumonia in COVID-19 suspects; second, to assess its performance in a real-world clinical setting; and third, by integrating the model with the daily clinical workflow, to measure its impact on report turn-around time. (2) Methods: The model was developed from the NIH Chest-14 open-source dataset and fine-tuned using an internal dataset comprising more than 4000 CXRs acquired in our institution. Input from two senior radiologists provided the reference standard. The model was integrated into daily clinical workflow, prioritising abnormal CXRs for expedited reporting. Area under the receiver operating characteristic curve (AUC), F1 score, sensitivity, and specificity were calculated to characterise diagnostic performance. The average time taken by radiologists in reporting the CXRs was compared against the mean baseline time taken prior to implementation of the AI model. (3) Results: 9431 unique CXRs were included in the datasets, of which 1232 were ground truth-labelled positive for pneumonia. On the “live” dataset, the model achieved an AUC of 0.95 (95% confidence interval (CI): 0.92, 0.96) corresponding to a specificity of 97% (95% CI: 0.97, 0.98) and sensitivity of 79% (95% CI: 0.72, 0.84). No statistically significant degradation of diagnostic performance was encountered during clinical deployment, and report turn-around time was reduced by 22%. (4) Conclusion: In real-world clinical deployment, our model expedites reporting of pneumonia in COVID-19 suspects while preserving diagnostic performance without significant model drift.


Introduction
An outbreak caused by the SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) was first identified in Wuhan, China in December 2019 [1]. The World Health Organisation first established COVID-19 as a global pandemic on 11 March 2020 [2]. Singapore is an independent city-state with one of the highest testing [3] and vaccination rates in the world [4], with the National Centre for Infectious Diseases (NCID) at the heart of the nation's response to COVID-19. At the time of writing, Singapore has seen more than 280,000 confirmed cases with 835 COVID-19-related deaths [5].
Although reverse transcription polymerase chain reaction (RT-PCR) is the gold standard for diagnosing COVID-19, the process can be time consuming and sometimes results in delay in diagnosis. Computed tomography (CT) findings have been shown to correlate well with RT-PCR result [6], and since the pandemic began, deep learning algorithms have been developed to diagnose COVID-19 through CT scans [7][8][9] or through predictive models built around laboratory results [10]. However, the American College of Radiology (ACR) cautioned against using CT as a first-line test to diagnose COVID-19 given the non-specific imaging findings of the disease and the fact that a normal chest CT does not preclude COVID-19 [11]. Thus, most ambulatory care facilities still rely on the chest radiograph as the mainstay of initial radiological investigation in COVID-19 screening workflows, as these are more easily acquired, and X-ray machines are logistically more amenable to infection control measures.
The clinical value of CXRs is less for diagnosis of COVID-19 but more for identification of imaging signs of pneumonia and assessment of severity for clinical decision making, as more groups utilise CXRs to analyse the severity of disease and to predict clinical outcome [12][13][14]. Published recommendations of the Flesichner Society [15] and the European Society for Thoracic Imaging are in line with this approach [16]. Several groups have applied deep learning methods for detection of COVID-19 pneumonia on CXRs often with accuracies of over 90% [17][18][19][20][21][22][23]. However, most of these models were tested in simulated environments, and their diagnostic performance in the real-world setting has not been validated. It is known that the diagnostic performance of algorithms can degrade significantly when deployed in clinical practice [24]. Furthermore, the clinical impact of these image-based deep learning models is not often measured in the same setting.
At the height of the first wave of the pandemic in 2020, our team developed a deep learning model trained with CXRs acquired in a specialised COVID-19 national screening facility and assessed its ability to detect pneumonia in COVID-19 suspects. To take it a few steps further, we assessed and validated the model's diagnostic performance in a real-world setting before deploying and integrating the model into our daily clinical workflow. In this study, we measured its performance on a "live" dataset and further evaluated its impact on the turn-around time.

Materials and Methods
This study was approved by the institutional review boards of the respective institutions and compliant with the Health Insurance Portability and Accountability Act (HIPAA). A waiver of consent was granted due to the retrospective nature of the study and minimal risks involved.

Datasets
There were three separate datasets included in this study, including the training set, proof-of-concept (POC) offsite test set, and the "live" clinical deployment set. All CXRs were acquired in the same hospital and re-sized to a resolution of 224 × 224 pixels. Two senior radiologists (each with more than a decade of experience) provided the reference standard. Any disagreement was resolved through discussion of perceived radiological findings and through consultation with two other senior radiologists not directly involved in the study. The team chose to adopt radiologists' inputs as reference standard (instead of PCR results), as the goal of this project was to create a tool to carry out high-throughput screening to triage COVID-19 suspects with radiologically proven pneumonia; as such, approximating the model's performance to that of senior radiologists was deemed more relevant. Furthermore, we know that most patients with COVID-19 do not develop pneumonia, especially earlier on in the disease [25]. Therefore, as a point to note, in the following datasets, the "pneumonia-negative" groups may contain patients who have tested positive on the SARS-CoV-2 RT-PCR test given that the ground-truth labels are radiological observations. The converse is also true.

Training Set
The training dataset comprised of frontal CXRs acquired in NCID and NCID's partner hospital. A total of 4277 radiographs were included; 971 of these were ground truthlabelled positive for pneumonia. Out of these pneumonia-positive radiographs, 465 were culture-confirmed bacterial pneumonia or non-COVID-related viral pneumoniae, while the other 506 were RT-PCR confirmed COVID-19 cases. Cases that were labelled negative for pneumonia constituted the negative group even if (1) there were other abnormalities noted on the chest radiograph (e.g., pneumothorax, atelectasis, etc.), or (2) they tested positive on the COVID-19 RT-PCR test. One-fifth of the training set was held out as the validation set.

Proof-of-Concept (POC) Offsite Test Set
A total of 1440 frontal CXRs acquired in NCID were used as an offsite test dataset. Seventy-two of these radiographs were ground truth-labelled positive for pneumonia. With regards to demographics, 84% of these patients were male (n = 1209), while 16% were female (n = 231), with an average age of 35.8.

Clinical Deployment Set
The clinical deployment set included 3714 frontal CXRs acquired in NCID, with 189 ground truth-labelled as positive for pneumonia. In terms of demographics, 87% of these patients were male (n = 3241), while 13% were female (n = 473). The average age of the patient population for this dataset was 37. Both the offsite test and the clinical deployment datasets were predominantly made up of young, adult males because Singapore was experiencing a surge in cases in the foreign worker dormitories at the time the deployment was carried out.
All patients in the latter two dataset were patients suspected of having COVID-19 at the time the CXR was acquired. A large majority of the CXRs in these two datasets were notably unremarkable, as we exclusively utilised CXRs that were acquired in NCID, which did not attend to patients with non-COVID-related complaints.
While the reference standard we adopted was input from senior radiologists, we collected the SARS-CoV-2 RT-PCR results from all COVID-suspect patients involved in the study. This is presented in Appendix A.

Development of the Deep Learning Model
To rapidly deploy a deep learning model for COVID-19 screening, we constructed and tested an ensemble model with the training dataset. We leveraged on transfer learning based on existing trained networks in view of the relative urgent need for increased efficiency in times of rapidly evolving global pandemic. Several groups have also tapped into transfer learning to create models for detecting COVD-19 on X-rays, CT, and even ultrasound [26,27].

Transfer Learning on Deep Neural Networks
Transfer learning is a technique for predictive modelling on a different but somehow similar problem that can then be reused partly or wholly to accelerate the training and improve the performance of a machine learning model [28]. The most commonly used transfer learning technique for deep neural network strategies are pre-training based methods [29]. We utilised the weight initialisation-based method, a common technique used in medical image analysis tasks, which reuses the trained weights on a source dataset as a start point for the target dataset. In this manner, all the weights should be adjusted for the target task in the fine-tuning step.

Network Architectures
We employed the Dense CNN (DenseNet [30]) as the backbone model to distinguish CXRs. The workflow of pre-training, initializing, and fine-tuning processes is shown in Figure 1. In pre-training (Step 1), the networks are firstly trained on the publicly available ChestX-ray14 dataset [31] to obtain some common features. In Step 2, the pre-trained weights except the last layer are used to initialise the DenseNet121 model for the hospital's dataset. For the final step, we fine-tuned all the layers in the DenseNet121 model using our hospital's dataset. To exploit data distribution and representation, our team created several different models for data ensemble using transfer learning, focal loss, weight cross entropy loss, and model adaptation. We expected data imbalance between the two classes (negative or positive for pneumonia), as most of the screening CXRs were likely to be normal. To tackle this issue of data imbalance, we utilised data augmentation through multiple sampling of the data and weighted loss to ensure representative results. The technical details of the network architecture are elaborated upon in Appendix B. improve the performance of a machine learning model [28]. The most commonly used transfer learning technique for deep neural network strategies are pre-training based methods [29]. We utilised the weight initialisation-based method, a common technique used in medical image analysis tasks, which reuses the trained weights on a source dataset as a start point for the target dataset. In this manner, all the weights should be adjusted for the target task in the fine-tuning step.

Network Architectures
We employed the Dense CNN (DenseNet [30]) as the backbone model to distinguish CXRs. The workflow of pre-training, initializing, and fine-tuning processes is shown in Figure 1. In pre-training (Step 1), the networks are firstly trained on the publicly available ChestX-ray14 dataset [31] to obtain some common features. In Step 2, the pre-trained weights except the last layer are used to initialise the DenseNet121 model for the hospital's dataset. For the final step, we fine-tuned all the layers in the DenseNet121 model using our hospital's dataset. To exploit data distribution and representation, our team created several different models for data ensemble using transfer learning, focal loss, weight cross entropy loss, and model adaptation. We expected data imbalance between the two classes (negative or positive for pneumonia), as most of the screening CXRs were likely to be normal. To tackle this issue of data imbalance, we utilised data augmentation through multiple sampling of the data and weighted loss to ensure representative results. The technical details of the network architecture are elaborated upon in Appendix B. Workflow of pre-training, initializing, and fine-tuning processes. C1 is a multi-label classifier with 14 elements, and C2 is a classifier containing two neurons.

Deployment of Model Ensemble
There are several approaches and flowcharts in diagnosing and ruling out COVID-19, and CXRs have been widely used as an integral part of the triage process [15,32]. NCID screening centre utilizes a similar workflow, depicted in Figure 2. The clinicians and Figure 1. Workflow of pre-training, initializing, and fine-tuning processes. C1 is a multi-label classifier with 14 elements, and C2 is a classifier containing two neurons.

Deployment of Model Ensemble
There are several approaches and flowcharts in diagnosing and ruling out COVID-19, and CXRs have been widely used as an integral part of the triage process [15,32]. NCID screening centre utilizes a similar workflow, depicted in Figure 2. The clinicians and radiologists further agreed upon a 1-h turn-around time (TAT) for interpretation of all CXRs done in NCID to facilitate patient flow and to minimize chokepoints [33]. The team calculated the TAT by extracting timestamps from the hospital's Centricity Radiology Information System (RIS) and computing the time elapsed between the completion of the CXR and final approval of the CXR report.
Healthcare 2022, 10, x FOR PEER REVIEW 5 of 17 radiologists further agreed upon a 1-h turn-around time (TAT) for interpretation of all CXRs done in NCID to facilitate patient flow and to minimize chokepoints [33]. The team calculated the TAT by extracting timestamps from the hospital's Centricity Radiology Information System (RIS) and computing the time elapsed between the completion of the CXR and final approval of the CXR report. The hospital uses the Centricity Radiology Information System (RIS) and Picture Archiving and Communication System (PACS) from GE Healthcare. A mini-PACS, RA600 (GE Healthcare, Chicago, IL, USA), was installed in a workstation serving as a DICOM listener to temporarily store DICOM files independently from the hospital's PACS system.
The X-ray modality used in NCID is the FDR Visionary Suite (Fujifilm Healthcare, Tokyo, Japan). These machines were configured to send the CXRs to the hospital's PACS and the RA600. Only frontal CXRs (anteroposterior and posteroanterior) were sent to the RA600; other projections were excluded. The RA600 receives and stores incoming DICOM files sent to the workstation. The local DICOM store is exported as a shared folder and mounted in the VM via the Server Message Block (SMB) protocol. From here, the model retrieves and interprets each CXR and assigns a binary value, either "0" for absence of pneumonia or "1" for the presence of pneumonia. The radiograph is then deleted from the workstation after processing to minimise the risk of unauthorised access to patients' identifiers. With regards to computational cost, the model takes less than 3 s for one image using all the models running on a CPU workstation (Intel Xeon Gold 6242, 2.8 GHz, RAM 256 G). In batch processing using GPU, it takes less than 0.01 s per image per model.
Finally, the results were retrieved via SSH File Transfer Protocol (SFTP) and updated automatically into the RIS system and matched by each case's accession number as the unique identifier. Cases that were flagged as positive by the algorithm were prioritized to the top of the radiologist worklist chronologically. This was done using a Health Level Seven (HL7) compliant script developed by the hospital's RIS/PACS vendor. A diagrammatic illustration of the deployment process is shown in Figure 3. Of note, this method merely provided the prediction of the model to the reporting radiologists; the responsibility of making the final diagnosis still rested with the radiologists. The hospital uses the Centricity Radiology Information System (RIS) and Picture Archiving and Communication System (PACS) from GE Healthcare. A mini-PACS, RA600 (GE Healthcare, Chicago, IL, USA), was installed in a workstation serving as a DICOM listener to temporarily store DICOM files independently from the hospital's PACS system.
The X-ray modality used in NCID is the FDR Visionary Suite (Fujifilm Healthcare, Tokyo, Japan). These machines were configured to send the CXRs to the hospital's PACS and the RA600. Only frontal CXRs (anteroposterior and posteroanterior) were sent to the RA600; other projections were excluded. The RA600 receives and stores incoming DICOM files sent to the workstation. The local DICOM store is exported as a shared folder and mounted in the VM via the Server Message Block (SMB) protocol. From here, the model retrieves and interprets each CXR and assigns a binary value, either "0" for absence of pneumonia or "1" for the presence of pneumonia. The radiograph is then deleted from the workstation after processing to minimise the risk of unauthorised access to patients' identifiers. With regards to computational cost, the model takes less than 3 s for one image using all the models running on a CPU workstation (Intel Xeon Gold 6242, 2.8 GHz, RAM 256 G). In batch processing using GPU, it takes less than 0.01 s per image per model.
Finally, the results were retrieved via SSH File Transfer Protocol (SFTP) and updated automatically into the RIS system and matched by each case's accession number as the unique identifier. Cases that were flagged as positive by the algorithm were prioritized to the top of the radiologist worklist chronologically. This was done using a Health Level Seven (HL7) compliant script developed by the hospital's RIS/PACS vendor. A diagrammatic illustration of the deployment process is shown in Figure 3. Of note, this method merely provided the prediction of the model to the reporting radiologists; the responsibility of making the final diagnosis still rested with the radiologists.
To ensure our infrastructure complied with the hospital's security standards, official approval was obtained from the Singapore's Integrated Health Information Systems (IHiS) committee overseeing the hospital's IT risk and security.

Proof of Concept (POC)-Offsite Test
Once our team had the model and infrastructure in place, we conducted an offsite test over the course of a week to ensure satisfactory model performance before officially incorporating the model into the daily workflow. The results of this offsite test are presented in a later segment. Healthcare 2022, 10, x FOR PEER REVIEW 6 of 17 To ensure our infrastructure complied with the hospital's security standards, official approval was obtained from the Singapore's Integrated Health Information Systems (IHiS) committee overseeing the hospital's IT risk and security.

Proof of Concept (POC)-Offsite Test
Once our team had the model and infrastructure in place, we conducted an offsite test over the course of a week to ensure satisfactory model performance before officially incorporating the model into the daily workflow. The results of this offsite test are presented in a later segment.

Statistical Analysis
As the networks were trained with the probability as output, we used binary cross entropy or focal loss based on class probability to classify a CXR as pneumonia or nonpneumonia. We could then use a threshold on the network output to obtain different sensitivities and specificities to form the ROC curve on any testing dataset. The performance of the algorithm was expressed in AUC, F1 score, sensitivity, specificity, and accuracy. The 95% confidence intervals (CIs) were computed using MATLAB R2014b. The CIs were computed by specific functions within the MATLAB interface (perfcurve and bootci) and used the bias-corrected and accelerated percentile method [34].

Results from Proof of Concept-Offsite Test Set
We validated our trained model with different methods and network architectures. Comparing each individual network architecture, the DenseNet121 transfer-learned achieved the best results. Our ensemble of seven models further enhanced the algorithms' performances; this is presented in Table 1. We compared our ensemble algorithm with existing, published deep learning algorithms using an offsite test dataset as a proof-ofconcept (POC). This POC dataset included a total of 1440 CXRs, of which 72 were labelled positive for pneumonia. Our ensemble algorithm achieved the highest AUC (AUC = 0.9369) with maximum F1 of 0.9120. The results of this POC test are depicted in Table 2, with the ROC curves displayed in Figure 4. Incidentally, the patch-based method [35] shows better result when the specificity is low and sensitivity is high, and this is better shown in Figure 5.

Statistical Analysis
As the networks were trained with the probability as output, we used binary cross entropy or focal loss based on class probability to classify a CXR as pneumonia or nonpneumonia. We could then use a threshold on the network output to obtain different sensitivities and specificities to form the ROC curve on any testing dataset. The performance of the algorithm was expressed in AUC, F1 score, sensitivity, specificity, and accuracy. The 95% confidence intervals (CIs) were computed using MATLAB R2014b. The CIs were computed by specific functions within the MATLAB interface (perfcurve and bootci) and used the bias-corrected and accelerated percentile method [34].

Results from Proof of Concept-Offsite Test Set
We validated our trained model with different methods and network architectures. Comparing each individual network architecture, the DenseNet121 transfer-learned achieved the best results. Our ensemble of seven models further enhanced the algorithms' performances; this is presented in Table 1. We compared our ensemble algorithm with existing, published deep learning algorithms using an offsite test dataset as a proof-of-concept (POC). This POC dataset included a total of 1440 CXRs, of which 72 were labelled positive for pneumonia. Our ensemble algorithm achieved the highest AUC (AUC = 0.9369) with maximum F1 of 0.9120. The results of this POC test are depicted in Table 2, with the ROC curves displayed in Figure 4. Incidentally, the patch-based method [35] shows better result when the specificity is low and sensitivity is high, and this is better shown in Figure 5.

Results from Clinical Deployment
A total of 3714 unique CXRs were included in deployment set, with 189 studies labelled as positive for pneumonia. On this "live" dataset, our algorithm achieved an AUC of 0.9456, 95% CI (0.9181, 0.9627) and maximum F1 of 0.9118. The ROC curve is illustrated in Figure 6. By setting a threshold, we have a confusion matrix, as shown in Table 3.
At the height of the pandemic in Singapore, our team saw a dire need to optimise hospital resources. As such, we opted for a model that did not "over-diagnose" pneumonia, as that may cloud the opinion of the reporting radiologists and result in unnecessary use of scarce healthcare resources. Therefore, we set a threshold that resulted in a higher negative predictive value (NPV).

Turnaround Time
Turn-around time (TAT) was calculated using the data extracted from the hospital's RIS. We measured the average TAT within a three-week window before and after the deployment of our deep learning model. Radiologists took an average of seven minutes to

Results from Clinical Deployment
A total of 3714 unique CXRs were included in deployment set, with 189 studies labelled as positive for pneumonia. On this "live" dataset, our algorithm achieved an AUC of 0.9456, 95% CI (0.9181, 0.9627) and maximum F1 of 0.9118. The ROC curve is illustrated in Figure 6. By setting a threshold, we have a confusion matrix, as shown in Table 3.  At the height of the pandemic in Singapore, our team saw a dire need to optimise hospital resources. As such, we opted for a model that did not "over-diagnose" pneumonia, as that may cloud the opinion of the reporting radiologists and result in unnecessary use of scarce healthcare resources. Therefore, we set a threshold that resulted in a higher negative predictive value (NPV).

Discussion
There have been growing efforts from researchers to develop an efficient and reliable AI solution to help diagnose patients with COVID-19. Most published studies related to our subject of interest attained good results but did not go on to test their models in a real-world clinical setting [17][18][19][20][21][22][23]35]. Al-Waisy et al. fused results obtained from two different deep learning methods to achieve accuracy rates of 99.93% [18]. Nayak et al. conducted a comprehensive review of eight pre-trained CNN models and found ResNet-34 to be the best performing model at 98.33% [19]. This study's model boasted an AUC of 0.95, with specificity of 97.1% and moderate sensitivity of 78.8%. The diagnostic performance was not degraded when deployed in the clinical setting, something which the prior studies have not demonstrated.
As a proof-of value, our group wanted to establish a model that could carry out highthroughput screening for COVID-19 pneumonia in large numbers of suspected COVID-19 patients to quickly triage patients with CXR radiographic findings prior to a RT-PCR diagnosis to stratify for high-risk patient management and optimize hospital resources. This is especially relevant now with the emergence of Delta and now the Omicron variants, as healthcare institutions around the world struggle to allocate scarce healthcare resources [39,40]. This study evaluated the diagnostic performance of our model on a "live" dataset and demonstrated that it is possible to preserve diagnostic performance of a deep learning algorithm when transferred to clinical deployment. This is likely due to an adequate dataset, unbiased in selection and curation, that was used for fine-tuning a base algorithm.
In this study, the deep learning model performed well utilising radiologists' input as reference standard. We chose to adopt senior radiologists' inputs as reference standard, as the team envisioned a tool that could help prioritise patients with definite radiological findings and thus alert the clinicians to these patients who are potentially more vulnerable to clinical deterioration [41] and should hopefully receive earlier intervention. Therefore, approximating the model's performance to that of senior radiologists was deemed more relevant. While the specific features of COVID-19 on CT have been described extensively [42,43], the features of COVID-19 on CXRs are more ambiguous. That said, a frontal CXR remains the mainstay of initial radiological screening in most institutions in this pandemic, making our model relevant for widespread adoption. In our clinical deployment dataset, 607 patients had a positive RT-PCR result, and 48 of these patients had a CXR that was flagged as abnormal by the deep learning algorithm. We postulate the model fared worse than those that have been published [18,19,44,45] because it was deployed in Singapore's national screening facility where a large majority of the patients were asymptomatic or oligosymptomatic on presentation, with more subtle radiological findings, thus posing a greater challenge to the model. This is largely secondary to the active tracing and aggressive testing policy that Singapore pursues.
In our institution, clinicians and radiologists have agreed upon a 1-h turn-around time (TAT) from the acquisition of a CXR image to the completion of a radiological report. Even though there was a modest 22% improvement in TAT during the deployment period, we believe that the value of the model in increasing radiologist efficiency has not been fully justified given the relatively low number of positive cases at our institution during the clinical deployment phase. As a triaging tool, we believe that the model can provide greater reduction in TAT in healthcare facilities that face large surges in demand for radiologist reporting.
The exact impact of the deep learning model on the eventual accuracy of the radiologists is beyond the scope of this study although work is underway in our institution to investigate this aspect further. Several studies have shown that their AI system can identify characteristics of COVID-19 on chest radiographs with performance comparable to experienced radiologists [44,46,47]. Harrison et al. further proved that a well-built AI system improved radiologists' performance in distinguishing COVID-19 pneumonia from non-COVID-19 pneumonia at chest CT [48]. In addition, the team attempted to employ saliency maps to detect features characteristic of COVID-19 on chest radiographs. The saliency maps indicate the regions of each radiograph that had the greatest influence on the model's prediction. We realised that while some of these maps accurately highlight diseased lung fields (See Figure A2 in Appendix C), their presence can be confounding in false-positive, false-negative, or even true-negative cases (See Figures A3 and A4 in Appendix C). Localisation of the pathology is not the main desired output of our network, but the utility of saliency maps in localisation models (e.g., detecting lung nodules on CT) warrants additional scrutiny. Zandehshahvar et al. reported a deep learning approach to analyse severity of COVID-19 [12] while Dayan et al. included CXRs as a key component of their clinical outcome predictive model [13]. In future advancements, a radiographic severity score can potentially be integrated into our algorithm with relevant outputs that can alert the clinicians to patients who are more vulnerable to clinical deterioration and therefore intervene earlier.
Even though this was a prospective clinical deployment, there remain several limitations to our study. Firstly, the development, validation, deployment, and testing of the algorithm were all done within a single institution. While working within a single institution hastened the relevant processes, cross-institution deployment would be needed to ensure replicability of our results and reliability of our model. However, the degree of variation is likely to be low given that our COVID-19 screening centre is the largest national facility, and it receives referrals from across our country. Secondly, the most frequently observed distribution patterns of COVID-19 include bilateral involvement, peripheral distribution, consolidations, and ground-glass opacification, whereas pleural effusions are rare [49,50]. Many of the features are also seen in bacterial and other viral pneumoniae; hence, it is not certain if our model can reliably differentiate patients with community acquired pneumonias from those of COVID-19. Thirdly, our algorithm is purely based on computer vision even though we know that in clinical practice, history of exposure, patient's symptoms, and laboratory results are important factors considered as part of the clinical diagnostic workup. Finally, our model was trained, validated, and tested on adult CXRs. As such, these results cannot be extrapolated to the paediatric population. Fortunately, only a small minority of the known COVID-19 cases are in children, and most of the confirmed cases in the paediatric population have relatively milder symptoms [51] and likely minimal CXR findings.

Conclusions
In conclusion, even as the global pandemic of COVID-19 evolves, chest radiography remains a valuable tool in the screening and severity assessment of disease. A reliable and readily deployable AI algorithm can expedite clinical decision making. Our team developed a deep learning algorithm that performs well in a simulated environment and preserves its diagnostic performance when tested with a "live" dataset in real-world clinical deployment. The algorithm was integrated into the clinical workflow and successfully reduced report turn-around time by prioritising abnormal cases.
Future work can incorporate non-image-related data into a single model to further improve performance and could involve cross-institution deployment to ensure replicability and reliability. Our team also aims to measure improvement in radiologists' performance (if any) following augmentation with our deep learning model.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. Technical Details of the Network Architecture and Performance Matrix
In brief, DenseNets are densely connected neural networks comprised of Dense Blocks and transition layers with convolutional layers and a max-pooling that connect each sequential block. In each Dense Block, there can be varying groups of [1 × 1] and [3 × 3] paired convolutional layers, with a pre-processing layer of Batch Normalization (BN) and rectified linear unit (ReLU). The original design of the DenseNet121 has a global averaging pooling layer followed by a 1000-node fully connected (FC) layer for feature mapping.

Pre-Training and Fine-Tuning
In the pre-training phase (Step 1), the model is first trained on a large scale dataset (ChestXray14), which encompassed 14 common thoracic diseases, including pneumonia, cardiomegaly, pneumothorax, etc. [31]. We utilised a 4:1 training/validation split on the NIH dataset. We optimised the Binary Cross Entropy (BCE) with Logits Loss, which combines a sigmoid layer and the BCE Loss in each class.
In the fine-tuning phase ( Step 3), we optimised the Weighted Cross Entropy Loss (WCEL) for the classifier (C2) on the target dataset. For a single sample x i in our TTSH dataset, we used y i to denote the corresponding label, and y i = [0, 1] T , or y i = [1, 0] T . The WCEL can thus be defined as: where u is a manual rescaling weight given to the positive class; we set u to be the ratio of the number of negative training samples to the number of positive training samples in this work.ŷ i denotes the network's prediction for the ith input image x i from the target domain. The network is trained end-to-end using Adam [52] with standard parameters (0.9, 0.999). The batch size of our model is 32. We used an initial learning rate of 0.0001 and picked the model with the highest Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) score on the validation set.
To diversify the model representative, we adjusted the CheXnet14 network (Dense121-121 layers of networks) with two fully connected (FC) layers (added two FCs, with 1024 nodes and 128 nodes, respectively, following a sigmoid node for classification) in the classification end (C2 in Figure 1) and experimented with different loss function at network tuning stage. In previous renditions of DenseNet121, Wang et al. adopted a weighted cross entropy loss [31], while Rajpurkar et al. utilised an unweighted binary cross entropy loss for network training. [53] To enhance the capability, we modified the loss function for the DenseNet121 with a Focal Loss (FL), [54] which is designed to handle the large data imbalance between classes. We also carried our data augmentation through multiple sampling to better represent the random distribution from the available dataset. We added an experiment to compare the results before and after data augmentation and weighted loss and found that the latter effectively improved the classification results. This is especially relevant in national screening centres, where most of the screening CXRs are expected to be normal. We also increased the FC layers after feature extraction to give us better distinguishing power for the classification of CXRs with features suggestive of COVID-19. The FL is defined by where p t = p, i f y = 1 1 − p, i f y = 0 , y is the ground truth of the class, and p is the estimation output of the network for an input CXR image. We adopted Pydicom [55] for image pre-processing. We set each CXR image mean to 0 and standard deviation to 1 for image intensity normalization. The parameter for Focal Loss is set as α = 0.25, γ = 2. Thus far, we utilised model adaptation and WCEL (Equation (A1), Model A) and transfer learning and FL (Equation (A2), Model B) to refine two models using the TTSH training dataset.
To exploit the data distribution and representation, we built another five models for data ensemble, using Model B as backbone model, as it has better performance in AUC and F1.
The five models were trained using five sub datasets randomly sampled from the TTSH training data. Experiments showed that with the model ensemble and data ensemble, the overall AUC and F1 outperformed any single model (Model A and Model B) or combination of them. Following with the terms TP, TN, FP, and FN, we can further define the performance matrix.

Appendix C. Examples Using Saliency Maps and Probability Output
The first image of each example shows the annotation by our radiologists, the second image demonstrates the saliency map generated for that image by our algorithm, and the last image shows the probability score of the algorithm's diagnosis. Of note, the red areas in the saliency maps display areas of interest and may not necessarily focus on the area of abnormality. (Equation (A2), Model B) to refine two models using the TTSH training dataset. To exploit the data distribution and representation, we built another five models for data ensemble, using Model B as backbone model, as it has better performance in AUC and F1. The five models were trained using five sub datasets randomly sampled from the TTSH training data. Experiments showed that with the model ensemble and data ensemble, the overall AUC and F1 outperformed any single model (Model A and Model B) or combination of them.
Following with the terms TP, TN, FP, and FN, we can further define the performance matrix.

Appendix C. Examples Using Saliency Maps and Probability Output
The first image of each example shows the annotation by our radiologists, the second image demonstrates the saliency map generated for that image by our algorithm, and the last image shows the probability score of the algorithm's diagnosis. Of note, the red areas in the saliency maps display areas of interest and may not necessarily focus on the area of abnormality.   (Equation (A2), Model B) to refine two models using the TTSH training dataset. To exploit the data distribution and representation, we built another five models for data ensemble, using Model B as backbone model, as it has better performance in AUC and F1. The five models were trained using five sub datasets randomly sampled from the TTSH training data. Experiments showed that with the model ensemble and data ensemble, the overall AUC and F1 outperformed any single model (Model A and Model B) or combination of them.
Following with the terms TP, TN, FP, and FN, we can further define the performance matrix.

Appendix C. Examples Using Saliency Maps and Probability Output
The first image of each example shows the annotation by our radiologists, the second image demonstrates the saliency map generated for that image by our algorithm, and the last image shows the probability score of the algorithm's diagnosis. Of note, the red areas in the saliency maps display areas of interest and may not necessarily focus on the area of abnormality. Figure A2. Selected true-positive example. The algorithm correctly diagnosed pneumonia in this case, where bilateral, multifocal airspace opacities are seen. Figure A3. Selected false-positive example. This example demonstrates a potential pitfall in CXRs with prominent breast shadows. Figure A3. Selected false-positive example. This example demonstrates a potential pitfall in CXRs with prominent breast shadows. Figure A4. Selected false-negative example. The airspace changes in this case are relatively mild, posing a greater challenge to the algorithm.