Peakhood: individual site context extraction for CLIP-seq peak regions

Abstract Motivation CLIP-seq is by far the most widely used method to determine transcriptome-wide binding sites of RNA-binding proteins (RBPs). The binding site locations are identified from CLIP-seq read data by tools termed peak callers. Many RBPs bind to a spliced RNA (i.e. transcript) context, but all currently available peak callers only consider and report the genomic context. To accurately model protein binding behavior, a tool is needed for the individual context assignment to CLIP-seq peak regions. Results Here we present Peakhood, the first tool that utilizes CLIP-seq peak regions identified by peak callers, in tandem with CLIP-seq read information and genomic annotations, to determine which context applies, individually for each peak region. For sites assigned to transcript context, it further determines the most likely splice variant, and merges results for any number of datasets to obtain a comprehensive collection of transcript context binding sites. Availability and implementation Peakhood is freely available under MIT license at: https://github.com/BackofenLab/Peakhood. Supplementary information Supplementary data are available at Bioinformatics online.


How Peakhood works
Here we briefly describe how Peakhood works. For full details, please check out Peakhood's comprehensive online manual at: https://github.com/BackofenLab/Peakhood

Site context extraction
To extract individual site context information for a CLIP-seq dataset, Peakhood's input consists of the genomic CLIP-seq peak regions (BED), the mapped CLIP-seq reads (BAM), a genomic annotations file (GTF), and a genome sequence file (.2bit). Peakhood first intersects the peak regions with transcript and exon regions from the GTF file, to obtain exonic, intronic, and intergenic sites. Next it determines for each exonic site whether it is more likely embedded in a genomic context (introns included) or transcript context (mature or spliced RNA). For this Peakhood makes use of the exon-intron read coverage ratios in the site neighborhood, as well as over the whole transcript. This is based on the observation that an exonic site inside a transcript (spliced) context (Paper Fig. 1a) usually features considerably more reads in the exon region(s), as well as a pronounced coverage drop-off at the exon borders. Ideally this is true both locally (around the overlapping exon) and globally (on the whole transcript). However, due to how the CLIP-seq protocol works, read coverage is often concentrated at and around the binding site, so Peakhood weighs the local context information higher than the global one. In addition, intron-spanning reads receive more weight than continuously mapped reads, since they provide strong support for a transcript context. Sites which feature sufficiently high local and global ratios (see online manual for more details on filter steps and thresholds) get assigned to transcript context. Exonic sites with lower ratios get assigned to genomic context (Fig. S1).

Choosing the most likely transcript
Since a gene usually has many transcript isoforms, and there can be several overlapping exons and transcripts which pass the filters, a transcript context site can have several possible site-transcript combinations. Peakhood thus also selects the most likely combination, based on number of informative filters: co-occurrence of other sites, read coverage, intron-spanning read numbers, and transcript support level. Filter order, choice of filters, and filter behavior (serial or majority vote) can be further customized. In addition, sites at exon borders connected by intron-spanning reads get merged into single sites (see Paper Fig. 1a example). Reference and custom transcript annotations are supported, which can be advantageous if created for the same cell types or conditions (see online manual on how to create custom annotations). Incorporation of RNA-seq data is also possible, to provide additional intron-spanning read information for transcript selection.

Merging transcript context sets
In addition to extracting transcript context sites for single CLIP-seq datasets, Peakhood can merge any number of transcript context sets into comprehensive transcript context site collections (see Paper Fig. 1b for general workflow). Output table files contain information on transcripts and overlapping sites, both for all and the most likely site-transcript combinations. Site pairs on transcripts and their genomic and transcript distances are also reported. This way, one e.g. can quickly filter for and spot interesting site pairs (same or different RBP), where the transcript site distance is lower than the original genomic distance.

Agreement with known RBP roles
When Peakhood performs the site context extraction, it reports (among other statistics) three informative percentages: the percentage of exonic sites (divided by all sites), the percentage of transcript context sites (divided by all exonic sites), and the percentage of exon border sites (divided by all transcript context sites). Fig. S2 shows these percentages for four eCLIP datasets from four different RBPs, obtained by running site context extraction (peakhood extract) with default parameters. We can see that for typical spliced RNA binding RBPs (IGF2BP1, PUM1, PUM2), most sites overlap with exons (>= 95%), and out of these >= 95% are assigned to transcript context. In contrast, for the splicing factor U2AF2 we get around 20% of exonic sites, and out of these only 5.9% get assigned to transcript context. This shows that Peakhood's transcript context selection agrees with known RBP roles. We also see that the number of exon border sites can be quite substantial, as in the case of PUM1 (around 25%). Such sites at exon borders connected by intron-spanning reads need to be merged, and not taken as separate binding events (see Paper Fig. 1). This again showcases the importance of a proper site context selection as done by Peakhood.

Runtime measurement
For the runtime measurement (inside conda environment with Peakhood installed), we downloaded and pre-processed the eCLIP PUM1 data (K562 cell line) as described in Peakhood's online manual on GitHub [3].