Format

Send to

Choose Destination
Nat Commun. 2016 Sep 16;7:12797. doi: 10.1038/ncomms12797.

Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes.

Author information

1
Pathogen Genomics, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK.
2
Department of Mathematics and Statistics, University of Helsinki, Helsinki FI-00014, Finland.
3
Department of Medical and Clinical Genetics, Genome-Scale Biology Research Program, University of Helsinki, Helsinki FI-00014, Finland.
4
Department of Medicine, University of Cambridge, Cambridge CB2 0SP, UK.
5
Department of Infectious Disease Epidemiology, Imperial College, London W2 1NY, UK.
6
Department of Computer Science, Aalto University, Espoo FI-00076, Finland.
7
Helsinki Institute of Information Technology HIIT, Department of Computer Science, Aalto University, Espoo FI-00076, Finland.
8
Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria 3010, Australia.
9
Centre for International Child Health, Department of Paediatrics, University of Melbourne, Melbourne, Victoria 3052, Australia.
10
Group A Streptococcal Research Group, Murdoch Children's Research Institute, Parkville, Victoria 3052, Australia.
11
Menzies School of Health Research, Darwin, Northern Territory 0811, Australia.
12
Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland.
13
Department of Biostatistics, University of Oslo, 0317 Oslo, Norway.

Abstract

Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens Streptococcus pneumoniae and Streptococcus pyogenes, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of S. pyogenes. We thus demonstrate that our method can answer important biologically and medically relevant questions.

PMID:
27633831
PMCID:
PMC5028413
DOI:
10.1038/ncomms12797
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Nature Publishing Group Icon for PubMed Central
Loading ...
Support Center