Format

Send to

Choose Destination
Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

RepLong: de novo repeat identification using long read sequencing data.

Author information

1
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.
2
School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK.
3
Centre for Computational Biology, School of Biosciences, University of Birmingham, Birmingham B17 2TT, UK.
4
College of Information Science.
5
School of Medicine, Shenzhen University, Shenzhen 518060, China.

Abstract

Motivation:

The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats.

Results:

In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data.

Availability and implementation:

The software of RepLong is freely available at https://github.com/ruiguo-bio/replong.

Contact:

ywsun@szu.edu.cn or zhuzx@szu.edu.cn.

Supplementary information:

Supplementary data are available at Bioinformatics online.

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center