GEO Accession viewer

NCBI > GEO > Accession Display

Not logged in | Login

GEO help: Mouse over screen elements for information.

Sample GSM1440666

Query DataSets for GSM1440666

Status

Public on Dec 04, 2014

Title

TAIL-seq exp #2 Mock

Sample type

SRA

Source name

TS2_Mock

Organism

Homo sapiens

Characteristics

cell line: HeLa
cell type: human cervical cancer cell line

Treatment protocol

HeLa cells were transfected twice with 40 nM of siRNAs using Lipofectamine 2000 (Invitrogen). A mixture of 2-4 different siRNAs was used to knockdown each gene. In combinatorial knockdown, we mixed siRNA pools to have a final concentration of 40 nM.

Growth protocol

HeLa cells were maintained in DMEM (Welgene) supplemented with 10% fetal bovine serum (Welgene).

Extracted molecule

total RNA

Extraction protocol

Total RNAs were extracted from HeLa cells by TRIzol reagent (Invitrogen, 15596-018) and purified by RNeasy MinElute column (Qiagen, 74204) according to manufacturer’s instruction.
TAIL-seq was carried out as described previously (Chang et al., 2014b). Briefly, 25-50 ug of total RNAs (>200 nt) were extracted by TRIzol (Invitrogen), purified by RNeasy MinElute column (Qiagen), and rRNA-depleted by using Ribo-zero kit (Epicentre). The RNAs were ligated to the biotinylated 3′ adaptor, and partially digested by RNase T1. The fragmented RNAs were pull-downed with streptavidin beads, phosphorylated at the 5′ end, and gel purified (500-1000 nt). The purified RNAs were ligated to the 5′ adaptor, reverse-transcribed, and amplified by PCR.

Library strategy

OTHER

Library source

transcriptomic

Library selection

other

Instrument model

Illumina MiSeq

Description

TAIL-seq for mock treatment HeLa cells, set #2

Data processing

Library strategy: TAIL-Seq
The base calls and signal intensities were processed by Illumina RTA 1.17.21.3 for HiSeq 2500, or RTA 1.17.28 for MiSeq. The read 1 sequences were reanalyzed for more sensitive basecalling using AYB 2. The read 1 sequences were aligned to the common contaminants set, which is composed of rDNA repeat units (GenBank accession U13369.1), PhiX genome (GenBank accession J02482.1), Illumina TruSeq primer sequences, and all sequences for 5S and 5.8S rRNAs of respective species (retrieved from Rfam 11.0 of the Wellcome Trust Sanger Institute) using GSNAP 2013-03-31 with maximum 5% mismatches allowed. Clusters with any match to the contaminants were removed from the subsequent analyses. The sequences having completely identical nucleotides in the 21st to 35th cycle in read 1 (representative region of the insert) and the 1st to 15th cycle in read 2 (degenerate bases in 3′ adapter) are deduplicated by leaving only a cluster with the maximum PHRED quality sum of read 1. The degenerate and fixed delimiter sequence in 3′ adapter was clipped out from read 2 by searching perfect match of delimiter sequence (‘GTCAG’ as in the direction of read 2) between the 14th and 16th cycles in read 2. The clusters missing a delimiter sequence or having low diversity in degenerate region (at least two occurrences for all of A, C, G and T) were removed from further analyses.
The fluorescence signal intensities were processed into “Relative T signal” as described in our previous paper (Chang et al, 2014b). The signals from a spike-in sample were purified with an outlier filter based on robust Mahalanobis distance (mvoutlier package 1.9.9; quan=0.5, alpha=0.025). Random 500 clusters per each spike-in were chosen for parameter calculation of a Gaussian mixture hidden Markov model (GMHMM). We trained the model using Baum-Welch algorithm implemented in the GHMM library (http://ghmm.org) with topology and initial parameters shown in fig. S1A and table S7 and S8 (1,000 iterations). The procedure was iterated to maximize likelihood, not using any property (eg. designed length of poly(A) tail) of spike-ins. Relative T signals outside the range of [-5, 5] were clipped into the range for both training and later calculations.
The length of poly(A) tails were first measured with base call-based “Strategy II” described in fig. S1A. For clusters with the measured length is shorter than 8 nt, the length is called as the final poly(A) tail length. For the others, normalized T signals starting from the first position in T-stretch detected by Strategy II were analyzed with the GMHMM. The hidden states were decoded with the standard Viterbi algorithm implemented in the GHMM library. The number of cycles with state 1 and 2 was called as the length of poly(A) tail. For the estimation of performance, we applied the process to all spike-in samples except the clusters used for the parameter fitting of the model.
The remaining reads after contaminant filter and the first duplication filters were then aligned to the genome sequences (UCSC hg19, positions of splicing junctions were processed from the UCSC Genome Browser database for version of Jan 24, 2013) using GSNAP 2013-03-31. Three different versions of alignments to genome were used in this study. (1) R1 alignment: using only the full read 1 sequences which are 51 nt long. This was used for identification of a cluster. (2) R2 short alignment: using only 40 nt right next to the 3′ adapter of read 2. This was used in searching for the poly(A)-free 3′ hydroxyl ends. (3) paired alignment: using the full read 1 sequences and part of read 2 sequences trimmed of degenerate bases and delimiter. We filtered out poly(A) stretches encoded from genome using this alignment set. All the alignments were performed with maximum mismatches of 5%, minimum mapping quality of 3. All multi-mapped reads were removed. The remaining PCR artifacts with few mismatches were removed again using the R1 alignment with 15 degenerate bases inside the 3′ adapter region. To detect that kind of artifacts, we clustered the R1 alignments with maximum distance between mapped positions of 10 bp, they were then clustered again within the first cluster using degenerate bases from read 2 of respective reads with CD-HIT-EST 4.5.4 (word size=6, sequence identity=0.85). For a set of detected duplicates, we chose a read with maximum sum of PHRED quality in read 1 to leave.
For classification and transcript-level analyses, we compiled reference annotations for human and mouse using NCBI RefSeq, RepeatMasker, gtRNAdb, Rfam and miRBase databases (the first three were downloaded from the UCSC Genome Browser on Apr 25, 2013; Rfam version 11; miRBase version 19). The R1 alignments were annotated with intersection with the compiled annotations using BEDTools {Quinlan, 2010 #66}. When multiple annotations were overlapped to an alignment, we chose a class for the statistics requiring exclusive assignment of a genomic source type by the following priority: miRNA, rRNA, tRNA, Mt-tRNA, snoRNA, scRNA, srpRNA, snRNA, lncRNA, RNA, ncRNA, misc_RNA, Cis-reg, ribozyme, RC, IRES, frameshift_element, LINE, SINE, Simple_repeat, Low_complexity, Satellite, DNA, LTR, CDS, 3′ UTR, 5′ UTR, intron, Other, Unknown (higher priority first). The transcript-level analyses were performed using our custom non-redundant RefSeq (nrRefSeq) transcript set, which is a reduced set retaining only the longest isoform or transcript when regions overlap with each other. The positions of read 1 in nrRefSeq transcripts were positioned with BEDTools intersection between alignments to genome sequences and nrRefSeq annotation set, and then translated to the transcript-level coordination with in-house software.
As poly(A) tails were initially detected with a constraint that it must begin within the first 30 cycles, so the maximum detectable 3′ end modification of poly(A) tails was limited to the last 30 nucleotides of insert. To exclude A stretches obviously encoded from genomic sequence (with or without 3′ end modifications), we masked detected poly(A) tail ranges with read 2 alignments so that the 3′-most position of alignable (not clipped) is eliminated from poly(A) tail or its 3′ end modifications. All statistics regarding transcript-level modification rates were calculated for transcripts having more than 200 tags with poly(A) tails longer than 8 nt.
Genome_build: hg19
Supplementary_files_format_and_content: The spreadsheet files contain the poly(A) tail length distribution and 3' end modification frequencies next to poly(A) tails for all detected transcripts

Submission date

Jul 21, 2014

Last update date

May 15, 2019

Contact name

Hyeshik Chang

E-mail(s)

hyeshik@snu.ac.kr

Organization name

Seoul National University

Department

School of Biological Sciences

Lab

Hyeshik Chang Lab

Street address

Building 203 Room 525, School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu

City

Seoul

State/province

South Korea

ZIP/Postal code

08826

Country

South Korea

Platform ID

GPL15520

Series (2)

GSE59627	Uridylation by TUT4 and TUT7 marks mRNA for degradation [TAIL-Seq]
GSE59628	Uridylation by TUT4 and TUT7 marks mRNA for degradation

Relations

BioSample

SAMN02928635

SRA

SRX658157

Supplementary file	Size	Download	File type/resource
GSM1440666_TS2_Mock-tails.txt.gz	5.9 Mb	(ftp)(http)	TXT
SRA Run Selector
Raw data are available in SRA
Processed data provided as supplementary file