Deep learning for AI-based diagnosis of skin-related neglected tropical diseases: a pilot study

Background Deep learning, which is a part of a broader concept of artificial intelligence (AI) and/or machine learning has achieved remarkable success in vision tasks. While there is growing interest in the use of this technology in diagnostic support for skin-related neglected tropical diseases (skin NTDs), there have been limited studies in this area and fewer focused on dark skin. In this study, we aimed to develop deep learning based AI models with clinical images we collected for five skin NTDs, namely, Buruli ulcer, leprosy, mycetoma, scabies, and yaws, to understand how diagnostic accuracy can or cannot be improved using different models and training patterns. Methodology This study used photographs collected prospectively in Côte d’Ivoire and Ghana through our ongoing studies with use of digital health tools for clinical data documentation and for teledermatology. Our dataset included a total of 1,709 images from 506 patients. Two convolutional neural networks, ResNet-50 and VGG-16 models were adopted to examine the performance of different deep learning architectures and validate their feasibility in diagnosis of the targeted skin NTDs. Principal findings The two models were able to correctly predict over 70% of the diagnoses, and there was a consistent performance improvement with more training samples. The ResNet-50 model performed better than the VGG-16 model. Model trained with PCR confirmed cases of Buruli ulcer yielded 1-3% increase in prediction accuracy over training sets including unconfirmed cases. Conclusions Our approach was to have the deep learning model distinguish between multiple pathologies simultaneously – which is close to real-world practice. The more images used for training, the more accurate the diagnosis became. The percentages of correct diagnosis increased with PCR-positive cases of Buruli ulcer. This demonstrated that it may be better to input images from the more accurately diagnosed cases in the training models also for achieving better accuracy in the generated AI models. However, the increase was marginal which may be an indication that the accuracy of clinical diagnosis alone is reliable to an extent for Buruli ulcer. Diagnostic tests also have its flaws, and they are not always reliable. One hope for AI is that it will objectively resolve this gap between diagnostic tests and clinical diagnoses with addition of another tool. While there are still challenges to be overcome, there is a potential for AI to address the unmet needs where access to medical care is limited, like for those affected by skin NTDs.

3 42 became. The percentages of correct diagnosis increased with PCR-positive 43 cases of Buruli ulcer. This demonstrated that it may be better to input images 44 from the more accurately diagnosed cases in the training models also for 45 achieving better accuracy in the generated AI models. However, the increase was 46 marginal which may be an indication that the accuracy of clinical diagnosis alone 47 is reliable to an extent for Buruli ulcer. Diagnostic tests also have its flaws, and 48 they are not always reliable. One hope for AI is that it will objectively resolve this 49 gap between diagnostic tests and clinical diagnoses with addition of another tool.
50 While there are still challenges to be overcome, there is a potential for AI to 51 address the unmet needs where access to medical care is limited, like for those 52 affected by skin NTDs. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint Figure 2 provides eight example images which resulted in incorrect 242 prediction by our pilot AI model based on ResNet-50, with (k=50)% training data 243 for all data (Task 1). Numbers in the parentheses represent the likelihood of the 244 diagnosis by the prediction model [prediction label] as compared to the actual 245 diagnosis [true label], or the ground truth. An uncertainty score is also given to 246 each test image, which is calculated by the correlation between the predicted 247 probability with random guess. Higher correlation means higher uncertainty score.
248 The uncertainty score indicates the degree of irrelevant evidence the AI model 249 finds for the given test image used to predict its diagnosis. For example, Figure   250 2(a) shows a true label score for yaws of 0.187 and a predicted label for Buruli 251 ulcer of 0.254 with high uncertainty of 0.93. This means that the model predicted 252 the image to be more like Buruli ulcer than yaws, however it was also highly 253 uncertain. An uncertainty score closer to 1 represents higher uncertainty for the 254 diagnosis output. When it is 100% uncertain, AI estimates it to be a random guess 255 and provides a confidence score of 0.200 (5 diseases, 1/5 = 0.200). The AI 256 prediction is better when the uncertainty score is lower, although the diagnosis 257 could still be incorrect.

260
To further understand why we can achieve better performance on Buruli 261 ulcer and scabies but worse performance for instance on mycetoma, we used, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 295 Leprosy and mycetoma, used smaller sample sizes and had poorer performance.
296 For leprosy, we speculate that it was not only the sample size, but also the 297 complexity of the disease presentation that impacted performance (17). We had 298 a range of images from tuberculoid to borderline to lepromatous type leprosy, as 299 well as some included deformities and wounds that developed due to peripheral 300 neuropathy. We stratified these different conditions and ran the same analysis,  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint to our hypothesis, the percentage increase was minimal (3% for Buruli ulcer), 317 which may be an indication that the accuracy of clinical diagnosis alone is reliable 318 to an extent. Especially for Buruli ulcer, a previous study by Eddyani et al. has 319 shown that sensitivity of clinical diagnosis was as high as 92% (95% CI, 85-96%), 320 which was the highest among any other methods including PCR (18). PCR results 321 can be false negatives in Buruli ulcer due to several factors, for example, site of 322 sample collection, skills in sample taking and duration of the wound (19). While it 323 is currently the preferred test for diagnostic confirmation, it has its flaws and is 324 not always reliable. In many studies, PCR is considered 65-70% sensitive (20) or 325 even only 61% sensitive (21). Specificity is perhaps highest for the PCR positive 326 cases, but sensitivity is highest for clinically identified cases. The PCR positive 327 cases should be enriched for true cases, but it also misses true cases. One hope 328 for AI -which our findings also support -is that it will objectively resolve this gap  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint played some role in these being predicted as a Buruli ulcer case, as the most 342 commonly affect body parts in Buruli ulcer are the limbs (23,24). Figure 2(c) was 343 a case of yaws, but the main lesion was not centered, and the lesion of interest 344 was not very obvious. The backgrounds or the clothes may have disturbed the 345 predictions in cases such as in (a), (e), (g), and (h). It will be necessary to 346 understand these patterns in order to resolve incorrect predictions, which will be 347 one of our future study directions.

348
A major source of bias in AI applications stems from the availability and

363
There are limitations to our study, some of which were already described, 364 such as limited number of images and imbalance in image numbers between 365 diseases. Moreover, images were taken under different conditions, and they were . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint

390
Particularly, the hope is that it will address the unmet needs where access to 391 medical care is limited, like for those affected by skin NTDs.

392
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint Acknowledgements

413
We would like to pay special thanks to Prof. Bamba Vagamon and Prof. Almamy  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.23287243 doi: medRxiv preprint