HoCoRT: host contamination removal tool

BMC Bioinformatics. 2023 Oct 2;24(1):371. doi: 10.1186/s12859-023-05492-w.

Abstract

Background: Shotgun metagenome sequencing data obtained from a host environment will usually be contaminated with sequences from the host organism. Host sequences should be removed before further analysis to avoid biases, reduce downstream computational load, or ensure privacy in the case of a human host. The tools that we identified, as designed specifically to perform host contamination sequence removal, were either outdated, not maintained, or complicated to use. Consequently, we have developed HoCoRT, a fast and user-friendly tool that implements several methods for optimised host sequence removal. We have evaluated the speed and accuracy of these methods.

Results: HoCoRT is an open-source command-line tool for host contamination removal. It is designed to be easy to install and use, offering a one-step option for genome indexing. HoCoRT employs a variety of well-known mapping, classification, and alignment methods to classify reads. The user can select the underlying classification method and its parameters, allowing adaptation to different scenarios. Based on our investigation of various methods and parameters using synthetic human gut and oral microbiomes, and on assessment of publicly available data, we provide recommendations for typical datasets with short and long reads.

Conclusions: To decontaminate a human gut microbiome with short reads using HoCoRT, we found the optimal combination of speed and accuracy with BioBloom, Bowtie2 in end-to-end mode, and HISAT2. Kraken2 consistently demonstrated the highest speed, albeit with a trade-off in accuracy. The same applies to an oral microbiome, but here Bowtie2 was notably slower than the other tools. For long reads, the detection of human host reads is more difficult. In this case, a combination of Kraken2 and Minimap2 achieved the highest accuracy and detected 59% of human reads. In comparison to the dedicated DeconSeq tool, HoCoRT using Bowtie2 in end-to-end mode proved considerably faster and slightly more accurate. HoCoRT is available as a Bioconda package, and the source code can be accessed at https://github.com/ignasrum/hocort along with the documentation. It is released under the MIT licence and is compatible with Linux and macOS (except for the BioBloom module).

Keywords: Classification; Contamination; Microbiome; Shotgun metagenome; Software.

MeSH terms

  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Metagenome
  • Microbiota* / genetics
  • Sequence Analysis, DNA / methods
  • Software*