Comparative analysis of copy number variation detection methods and database construction

Asako Koike; Nao Nishida; Daiki Yamashita; Katsushi Tokunaga

doi:10.1186/1471-2156-12-29

Comparative analysis of copy number variation detection methods and database construction

BMC Genet. 2011 Mar 7:12:29. doi: 10.1186/1471-2156-12-29.

Authors

Asako Koike¹, Nao Nishida, Daiki Yamashita, Katsushi Tokunaga

Affiliation

¹ Central Research Laboratory, Hitachi Ltd., Tokyo, Japan. asako.koike.ea@hitachi.com

Abstract

Background: Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Furthermore, we identified CNVs of 180 healthy Japanese individuals using parameters that showed the best performance in the HapMap data and investigated their characteristics.

Results: The results indicate that Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance than other programs when the high reproducibility rates of the same individuals and the low Mendelian inconsistencies are considered. Furthermore, when rates of overlap with other experimental results were taken into account, Birdsuite showed the best performance from the view point of sensitivity but was expected to include many false negatives and some false positives. The results of 180 healthy Japanese demonstrate that the ratio containing repeat sequences, not only segmental repeats but also long interspersed nuclear element (LINE) sequences both in the start and end regions of the CNVs, is higher in CNVs that are commonly detected among multiple individuals than that in randomly selected regions, and the conservation score based on primates is lower in these regions than in randomly selected regions. Similar tendencies were observed in HapMap data and other experimental data.

Conclusions: Our results suggest that not only segmental repeats but also interspersed repeats, especially LINE sequences, are deeply involved in CNVs, particularly in common CNV formations.The detected CNVs are stored in the CNV repository database newly constructed by the "Japanese integrated database project" for sharing data among researchers. http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi.

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Asian People / genetics
DNA Copy Number Variations*
Databases, Genetic*
Humans
Markov Chains
Models, Genetic*
Oligonucleotide Array Sequence Analysis