Format

Send to

Choose Destination
Nat Biotechnol. 2014 Sep;32(9):888-95. doi: 10.1038/nbt.3000. Epub 2014 Aug 24.

Detecting and correcting systematic variation in large-scale RNA sequencing data.

Author information

1
1] Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA. [2] The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA. [3].
2
1] Chair of Bioinformatics Research Group, Boku University Vienna, Vienna, Austria. [2].
3
Chair of Bioinformatics Research Group, Boku University Vienna, Vienna, Austria.
4
Department of Bioinformatics, WEHI, Melbourne, Australia.
5
State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, Schools of Life Sciences and Pharmacy, Fudan University, Shanghai, China.
6
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
7
Center for Genomics and Division of Microbiology &Molecular Genetics, School of Medicine, Loma Linda University, Loma Linda, California, USA.
8
National Center for Biotechnology Information (NCBI), Bethesda, Maryland, USA.
9
1] Chair of Bioinformatics Research Group, Boku University Vienna, Vienna, Austria. [2] University of Warwick, Coventry, UK.
10
1] Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA. [2] The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA. [3] The Feil Family Brain and Mind Research Institute, New York, New York, USA.

Abstract

High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.

PMID:
25150837
PMCID:
PMC4160374
DOI:
10.1038/nbt.3000
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Nature Publishing Group Icon for PubMed Central
Loading ...
Support Center