Unique features of transcription termination and initiation at closely spaced tandem human genes

Abstract The synthesis of RNA polymerase II (Pol2) products, which include messenger RNAs or long noncoding RNAs, culminates in transcription termination. How the transcriptional termination of a gene impacts the activity of promoters found immediately downstream of it, and which can be subject to potential transcriptional interference, remains largely unknown. We examined in an unbiased manner the features of the intergenic regions between pairs of ‘tandem genes’—closely spaced (< 2 kb) human genes found on the same strand. Intergenic regions separating tandem genes are enriched with guanines and are characterized by binding of several proteins, including AGO1 and AGO2 of the RNA interference pathway. Additionally, we found that Pol2 is particularly enriched in this region, and it is lost upon perturbations affecting splicing or transcriptional elongation. Perturbations of genes involved in Pol2 pausing and R loop biology preferentially affect expression of downstream genes in tandem gene pairs. Overall, we find that features associated with Pol2 pausing and accumulation rather than those associated with avoidance of transcriptional interference are the predominant driving force shaping short tandem intergenic regions.


Table of contents
Page 2 Appendix Figure S1. Enriched motifs in STIRs. 4 Appendix Figure S2. Pol2 marks at STIRs sense and antisense strands. 5 Appendix Figure S3. Evidence for elongating Pol2 at STIRs. 6 Appendix Figure S4. KD candidates for the transcription regulation of downstream tandem genes. 8 Appendix Figure S5. Pol2 marks at increasing lengths of tandem intergenic regions. 9 Appendix Figure S6. Cluster analysis of STIR normalized to promoter control. 11 Appendix Figure S7. Cluster analysis of STIR normalized to 3' control.
(A) Top: barplots showing the proportion of STIRs (purple) or control sequences (beige and orange) that carry the DNA-mode STREME-discovered motifs in either HepG2 or K562 defined co-expressed tandem genes and their respective mean aggregated controls. Shown are Bonferroni corrected proportion test P-values (*:P<=0.05, **:P<=0.01, ***:P<=0.001, ****:P<=0.0001). Bottom plots show the overall average number and standard error of motif occurrences at STIRs or control sequences. P-values were obtained using paired Wilcoxon rank-sum test and corrected using Bonferroni correction.
(C) Boxplots showing the fraction of dinucleotide-preserving shuffled sequences of STIRs or controls that contain either of the G-rich motifs (GGGGCGGGGSC motif found based on 164 co-expressed tandem pairs in K562 cell line and GGGGCGGG found in HepG2 based on 188 tandem pairs, each of them compared to their respective controls), based on 1,000 shuffling iterations. Bars correspond to the prevalence of either motif in the actual sequences. The thickened line in the boxplots CCCCCCCCC GGGGGGGGG **** **** represents the median percent of shuffled sequences that carry the respective sequence, the lower and upper boxplot hinges correspond to first and third quartiles of the data, respectively. The whiskers represent the minimal/maximal existing value within 1.5 × inter-quartile range. Outliers were removed from the analysis. Shown also is the calculated Z-test P-value between occurrence prevalence at each group and its shuffled sequences (*:P<=0.05, ****:P<=0.0001).
(D) As in the bottom plot of (A), but for the G4NG4 motifs (or their reverse complement) in 188 tandem pairs defined HepG2 cells or their mean aggregated controls. P-values were obtained using paired Wilcoxon rank-sum test and corrected using Bonferroni correction (****:P<=0.0001).
Appendix Figure S2. Pol2 marks at STIRs sense and antisense strands.     Figure S4. KD candidates for the transcription regulation of downstream tandem genes.

T4P
(A) Boxplot of expression changes in upstream or downstream 164 co-expressed tandem genes following NELF-E KD in K562 cells (Data from ENCODE) or their controls (5 mean aggregated controls per tandem gene). Blue dots correspond to tandem pairs co-expressed in both K562 and HepG2 cell lines (or their respective control). Black dots are tandem genes co-expressed only in the respective cell line. The thickened line represents the median log2 fold change following KD, the lower and upper boxplot hinges correspond to first and third quartiles of the data, respectively. The whiskers represent the minimal/maximal existing value within 1.5 × inter-quartile range. Outliers were removed from the analysis. P-values were obtained using paired Wilcoxon rank-sum tests.

Maximal Tandem Value Divided By Maximal Value of 3' Control
Clustering of the experiments was done using Pearson's correlation. Clustering of STIRs was done by computing the Euclidean distance. Left annotations show the log2-transformed expression levels of the respective genes within the different cell lines.