Send to

Choose Destination
Bioinformatics. 2018 Sep 1;34(17):2889-2898. doi: 10.1093/bioinformatics/bty211.

Inference of the human polyadenylation code.

Leung MKK1,2, Delong A1,2, Frey BJ1,2,3.

Author information

Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada.
Deep Genomics, MaRS Centre, Toronto, Canada.
Banting and Best Department of Medical Research, University of Toronto, Toronto, Canada.



Processing of transcripts at the 3'-end involves cleavage at a polyadenylation site followed by the addition of a poly(A)-tail. By selecting which site is cleaved, the process of alternative polyadenylation enables genes to produce transcript isoforms with different 3'-ends. To facilitate the identification and treatment of disease-causing mutations that affect polyadenylation and to understand the sequence determinants underlying this regulatory process, a computational model that can accurately predict polyadenylation patterns from genomic features is desirable.


Previous works have focused on identifying candidate polyadenylation sites and classifying tissue-specific sites. By training on how multiple sites in genes are competitively selected for polyadenylation from 3'-end sequencing data, we developed a deep learning model that can predict the tissue-specific strength of a polyadenylation site in the 3' untranslated region of the human genome given only its genomic sequence. We demonstrate the model's broad utility on multiple tasks, without any application-specific training. The model can be used to predict which polyadenylation site is more likely to be selected in genes with multiple sites. It can be used to scan the 3' untranslated region to find candidate polyadenylation sites. It can be used to classify the pathogenicity of variants near annotated polyadenylation sites in ClinVar. It can also be used to anticipate the effect of antisense oligonucleotide experiments to redirect polyadenylation. We provide analysis on how different features affect the model's predictive performance and a method to identify sensitive regions of the genome at the single-based resolution that can affect polyadenylation regulation.

Supplementary information:

Supplementary data are available at Bioinformatics online.

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center