Format

Send to

Choose Destination
Genome Res. 2015 May;25(5):736-49. doi: 10.1101/gr.185892.114. Epub 2015 Mar 30.

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications.

Author information

1
Integrative Biosciences, Bioinformatics and Genomics Option, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; The Genome Science Institute at the Huck Institutes of Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA;
2
Integrative Biosciences, Bioinformatics and Genomics Option, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; The Genome Science Institute at the Huck Institutes of Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Department of Biochemistry and Molecular Biology, Pennsylvania State University, Pennsylvania 16802, USA;
3
Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Department of Pathology, The Jake Gittlen Laboratories for Cancer Research, Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA;
4
Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA;
5
Department of Computer Science and Engineering, Pennsylvania State University, University Park, Pennsylvania 16802, USA.
6
Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA;
7
Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; The Genome Science Institute at the Huck Institutes of Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Department of Biochemistry and Molecular Biology, Pennsylvania State University, Pennsylvania 16802, USA; Department of Computer Science and Engineering, Pennsylvania State University, University Park, Pennsylvania 16802, USA.
8
Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA; Center for Medical Genomics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; The Genome Science Institute at the Huck Institutes of Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA;

Abstract

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

PMID:
25823460
PMCID:
PMC4417121
DOI:
10.1101/gr.185892.114
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for HighWire Icon for PubMed Central
Loading ...
Support Center