Bacterial strain sharing between humans, animals, and the environment among urban households

Identifying bacterial transmission pathways is crucial to inform strategies aimed at curbing the spread of pathogenic and antibiotic-resistant bacteria, especially in rapidly urbanizing low- and middle-income countries. In this study, we assessed bacterial strain-sharing and dissemination of antibiotic resistance across humans, domesticated poultry, canines, household soil, and drinking water in urban informal settlements in Nairobi, Kenya. We collected 321 samples from 50 households and performed Pooling Isolated Colonies-seq (PIC-seq) by sequencing pools of up to five Escherichia coli colonies per sample to capture strain diversity, strain-sharing patterns, and overlap of antibiotic-resistant genes (ARGs). Bacterial strains isolated from the household environment carried clinically relevant ARGs, reinforcing the role of the environment in antibiotic resistance dissemination. Strain-sharing rates and resistome similarities across sample types were strongly correlated within households, suggesting clonal spread of bacteria is a main driver of dissemination of ARGs in the domestic urban environment. Within households, E. coli strain-sharing was rare between humans and animals but more frequent between humans and drinking water. E. coli contamination in stored drinking water was also associated with higher strain-sharing between humans in the same household. Our study demonstrates that contaminated drinking water facilitates human to human strain sharing and water treatment can disrupt transmission.


Preparation of simulated datasets for benchmarking
To benchmark our bioinformatics pipeline, we generated simulated datasets that emulate sequence reads we would expect from PIC-seq on five Escherichia coli isolates.The paired-end Illumina sequence reads (2x150 bp) and the complete genome sequences of five clinically relevant antibiotic resistant E. coli isolates, possessing either blaCTX-M-15 or mcr-1 genes along with other antibiotic resistant genes (ARGs), were provided by Broad Institute.We characterized each isolate based on their complete genome sequences.The phylogroup of each E. coli isolate was predicted using ClermonTyping v20.03 1 with default parameters.MOB-suite v3.1.0was used to identify any plasmid contents included in the genome files. 2 The gene coding sequences within genomes were predicted using Prodigal v2.6.3 3 and were annotated using DIAMOND blastp v2.0.14.152 4 against the Comprehensive Antibiotic Resistance Database (CARD) v3.2.4 with parameters set to minimum identity of 90%, minimum query cover 80%, and minimum amino acid length of 25.The proximity of mobile genetic elements (MGEs) to genes annotated as ARGs was examined in a 5000 bp region both upstream and downstream of the ARGs, using the mobileOG-pl pipeline and the mobileOGdb release beatrix-1.6as the reference database. 5 To generate simulated datasets, we subsampled the sequence reads from each E. coli isolate and pooled with 6 different proportions, ranging from equal ratios to 400-fold difference in sequence coverage, targeting a total sequencing depth of 2 Gb using seqtk v1.3 (Supplementary table 7).After pooling sequences, the quality of the reads was screened using Trimmomatic v0.39 6 with parameters set to LEADING: 3, TRAILING: 3, SLIDINGWINDOW: 4:15, and MINLEN: 70.The downstream analyses were performed subsequently based on these datasets.

Comparison of resistomes reconstructed from two different assembly strategies
To evaluate the efficacy of using a reference-based binning approach prior to de novo assembly in reconstructing resistomes, we compared resistome profiles generated from assemblies with and without the binning process.1) For assemblies that employed the binning approach, the procedure followed was identical to the approach used for real samples.The plasmid-like contigs were first assembled out of initial data, and the initial reads were aligned to these plasmid contigs and the reference genomes of strains predicted by StrainGE 7 using mSWEEP v2.0.0 8 and mGEMs v1.3.1 9 with default parameters.The reads binned to each reference genome were individually de novo assembled using metaSPAdes v3.15.4. 10 The unaligned reads were separately de novo assembled using metaSPAdes.2) The assembly without binning process was simply performed by running metaSPAdes on the initial simulated dataset with default parameters.For all contigs assembled from two different approaches, the coding gene sequences were predicted using Prodigal v2.6.3, 3 and the genes were annotated by running DIAMOND blastp v2.0.14.152 4 against the Comprehensive Antibiotic Resistance Database with the same parameter used for the real samples.The co-localization of MGEs was investigated by annotating coding genes located within 5000 bp both upstream and downstream of identified ARGs.
ARGs that had MGEs in their flanking regions were considered to have the potential for mobility.ARGs with no MGEs in their flanking regions, but whose flanking region were shorter than 5000 base pairs due to contig size limitations, were categorized as having ambiguous mobility.The coding regions of all ARGs were clustered at 100% nucleotide identity together with ARGs from reference genomes for the comparison of resistomes.

Assembly using short or long sequence reads of simulated datasets
To examine the efficacy of utilizing long read sequence data for investigating ARG sharing dynamics, we compared the assembly metrics of contigs assembled using short or long sequence reads.The long sequence reads of the matched five E. coli isolates were generated using Oxford Nanopore Technology MinION Platform and provided by Broad, along with the paired-end Illumina reads.We generated the simulated PIC-seq E. coli dataset for long reads using the methods described for the short reads.
In addition to the contigs assembled using only short reads, we employed two additional approaches; 1) assembling using short reads first, then correcting with long reads and 2) first assembling long reads and then polishing the resulting contigs with the short reads mapped onto them.Binning of sequence reads to the reference genomes identified by StrainGE was performed before assembly, as described for the short reads.To assemble contigs starting with short reads and then correct the resulting contigs with long reads, we used HybridSPAdes v3.15.4 11 with default parameters.For the assembly of contigs beginning with long reads, we utilized metaFlye v.2.8.1 12 with default parameters, followed by medaka v1.7.0 for error correction in the contigs, using the same long reads as input.Short reads were then mapped on to the contigs using bbmap, 13 and the resulting alignment files were utilized to polish errors in the contigs using pilon v1.24. 14The assembly metrics of the contigs were estimated by using MetaQUAST v5.2.0, 15 based on reference genomes provided as the complete genomes of E. coli isolates used for simulated datasets.

Strain identification of the simulated PIC-seq datasets
The characteristics of E. coli strains used for generating the simulated PIC-seq data were analysed based on their complete genome sequences.Genome sizes across the strains ranged from 4.7 to 5.3 Mbp.The isolate GTEN 247 had the highest number of plasmids while no plasmid sequences were detected in the GTEN 378 isolate (Supplementary Table 8).In terms of phylogroup classification, the E. coli strains consisted of phylogroups commonly observed in real samples, with the isolates GTEN 291 and GTEN 293 assigned to identical phylogroup A (Supplementary Table 8).The range of unique ARG clusters identified across the strains was comparable, varying from 45 to 57.The GTEN 378 isolate had the lowest number ARGs predicted to be mobile, possibly due to the absence of plasmid.When individual strain identification was conducted on the Illumina sequences for each isolate, distinct strains were identified for each one.
Subsequent strain identification was conducted on simulated PIC-seq datasets generated by pooling sequence reads from E. coli strains in varying proportions.In all datasets, only four strains were identifiable, and the reference strain initially associated with GTEN 291, which was E. coli NCTC9087, was not identified, possibly due to high similarities (> 99.0%ANI) between GTEN 291 and GTEN 293 (Supplementary table 7).Except for GTEN 291, the trends of the relative abundance of the identified strains and the pooled ratio were generally in good agreement.

Benchmarking of reconstructed resistomes
The purpose of performing PIC-seq on E. coli isolates was not only to have a higher number of strains investigated for strain-sharing events but also to examine the resistomes with their genomic contexts.Sequence data comprising up to five different isolates is expected to yield better contig reconstruction compared to sequence data from more complex microbial communities.However, the assembly process could be challenging in deconvoluting pooled strains as the strains were highly similar.We employed the assembly approach suggested in plate sweep metagenomics, which involves a preliminary binning process based on reference strains before de novo assembly.In addition to assembly of binned reads, an additional de novo assembly was performed on the reads that did not map to any of the reference genomes, and ARGs and MGEs were annotated on the entire contigs.We performed benchmarking tests on this bioinformatics pipeline to evaluate its performance in reconstructing the resistome of samples, using simulated datasets.We also compared the resistomes reconstructed from contigs that were de novo assembled without a binning process to evaluate whether the binning process enhanced the accuracy of resistome reconstruction.
The assembly incorporating a binning process consistently outperformed those that did not in reconstructing resistomes across all simulated datasets.With a clustering at 100% nucleotide identity, the simulated dataset was expected to contain 212 overall ARG clusters, comprising 118 non-mobile and 94 mobile ARG clusters (Supplementary Fig. 5).In resistomes with a binning process, the number of overall true-positive ARG clusters, which aligned with those from reference genomes, ranged from 173 to 189 (accounting for 81.6% to 89.2%).In contrast, resistomes without a binning process only had 32 to 59 overall true-positive ARG clusters (accounting for 15.1% to 27.8%) (Supplementary Fig. 5).The number of false-positive overall ARG clusters, defined as those deviating from ARG sequences in the reference genomes, was comparable between resistomes reconstructed with and without a binning approach in the simulated datasets of ratio 1, 2, and 3.However, in the simulated datasets with ratios 4, 5, and 6, which had at least a 100-fold difference in the pooling ratios between strains, the number of false-positive ARG clusters was significantly lower in resistomes reconstructed without the binning process (fewer than 10 instances).
The same trends were observed in both non-mobile and mobile resistomes.The non-mobile resistome had a higher number of true-positive ARG clusters, while the mobile resistome had a lower number of false-positive ARG clusters.
To determine which approach more accurately captures the true resistome, we calculated the Jaccard similarity between the resistomes derived from the simulated datasets and the reference genomes.In all the simulated datasets, resistomes reconstructed using the binning process exhibited at least a 1.9-fold higher Jaccard similarity to the reference compared to those without binning.This suggests that the incorporation of a binning process in the assembly pipeline significantly enhances the accuracy of resistome reconstruction in this study.(Supplementary Fig. 10).