Matched Annotation from NCBI and EBI (MANE)

What is MANE?

Matched Annotation from NCBI and EBI (MANE) is a collaboration between the National Center for Biotechnology Information (NCBI) and the European Molecular Biology Laboratories-European Bioinformatics Institute (EMBL-EBI). The goal of this project is to provide a minimal set of matching RefSeq and Ensembl transcripts of human protein-coding genes, where the transcripts from a matched pair are identical (5’ UTR, coding region and 3’ UTR), but retain their respective identifiers. The MANE transcript set is classified into three groups:

  1. MANE Select: One high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene.
  2. MANE Plus: A minimal set of well-supported transcripts that have additional characteristics, such as significant novel exons, which are not included in the MANE Select transcript.
  3. MANE: All other matched transcripts that are not included in the Select and Plus sets.

Rationale

While both the NCBI’s RefSeq and EBI’s Ensembl-GENCODE annotations have similarities, they may have differences at the transcript level. Transcripts representing a specific splice structure or coding sequence may be missing from one of the two gene sets. Additionally, transcripts representing the same splice structure may differ in the length of the untranslated regions (UTRs) or have sequence mismatches due to SNPs. Consequently, researchers using a preferred gene set to design studies and to report results may find it difficult to communicate their work to others in the scientific community. Data resources, such as genome browsers and variation databases, may also use different annotation sets to represent a default transcript, which may cause confusion. Matched MANE transcripts, which are identical in the RefSeq and the Ensembl-GENCODE annotation sets, are expected to facilitate better communication and exchange of data among the scientific community when represented across most public genomic resources. In addition, the MANE dataset represents a high-quality annotation subset backed by expert curators and the combined computational strength of the NCBI and EBI.

MANE Select

As a first step in the MANE project, in December 2018, NCBI and EBI jointly released the first version of MANE Select (MANE v0.5), which covers 53% of human protein-coding genes. This is a ‘beta’ set available for testing by RefSeq and Ensembl users. Additional transcripts will be incrementally added to this set over the next year.

MANE Select Methodology

Choosing the transcript

Initially, independent pipelines at NCBI and EBI choose the ‘select’ transcript for each gene. The ‘RefSeq Select’ pipeline is described in the RefSeq Select section. The Ensembl pipeline uses similar criteria to choose the ‘select’ transcript, albeit with slightly different implementations.

MANE Select Flowchart

Figure 1. A flowchart showing the steps involved in the designation of a MANE Select transcript.

The transcript sets generated by the two pipelines are compared to identify matched pairs, where a match, at this point, is defined as the same splice structure and the same coding sequence (CDS). When a matching pair is not available, expert curators from the two groups examine the transcripts and create a match by 1) switching the pipeline choice of the RefSeq or the Ensembl ‘select’ to a different transcript, or 2) creating a new transcript when a matching transcript is not available in one of the annotation sets, or 3) updating the coding region of a transcript in one of the annotation sets, which is deemed wrong, to match the pipeline choice from the other annotation set.

Matching transcript ends

Once the splice structure and the coding region are matched, the next step is to match the transcript start and end coordinates of the two transcripts in the matched pair.

Transcript start: NCBI developed a method to leverage a high-througput sequencing technique called CAGE (cap analysis of gene expression), that specifically captures the 5’ ends of genes. We used CAGE data from the FANTOM consortium to determine the most likely used transcription start site (TSS). The precomputed CAGE data from the FANTOM5 dataset was reprocessed (Figure 2) to a) merge clusters that were close to each other (within 50 bases), and b) recalculate the TSS as the 5’-most base position within a cluster with a tag count that is at least 50% of that at the nucleotide position in the cluster with the maximum CAGE tag count. The goal of the reprocessing is to determine a frequently used TSS that is representative of the overall data, rather than the one with the absolute maximum tag counts.

MANE 5prime end

Figure 2: Determination of the 5’ end of matched transcripts (Gene MED16). This screenshot from NCBI’s Genome Data Viewer shows several useful data tracks for evaluating the transcript 5’ end. The ‘RefSeq-processed FANTOM CAGE peaks’ track (black horizontal bar) of the screenshot represents the RefSeq-processed CAGE cluster, while the green bars in the ‘FANTOM5 CAGE peaks, robust set’ track are CAGE clusters from the FANTOM5 data. The vertical red highlight marks the calculated transcription start site (TSS). The matching RefSeq and Ensembl transcripts (seen in the ‘Genes, MANE project (version 0.6)' track) have been updated to use that TSS. The calculated TSS corresponds well with the 5’ end of the overall conventional transcript data (as seen in the INSDC transcript coverage track).

Transcript stop: The last base of the transcript is decided based on polyadenylated transcript data from conventional transcripts as well as high-throughput polyA-seq studies (PMID:30840896, PMID:30143597, PMID:29891946, PMID:29234016, PMID:26801249, PMID:26765774, PMID:25906188 and PMID:22454233). The maximum extent of the 3’ untranslated region (3’ UTR) is determined based on conventional polyadenylated transcripts, when available. As in the case of the CAGE data, polyadenylation clusters were calculated using data from multiple high-throughput polyA-seq studies, and the 3’-most nucleotide in the cluster with a sequence read count that is at least 50% of the maximum count in the cluster, is determined as the last base of the transcript (Figure 3).

MANE 3prime end

Figure 3: Determination of transcript end (Gene NDUFS7). This screenshot from NCBI’s Genome Data Viewer shows several useful data tracks for evaluating the transcript 3’ end. The upper data tracks show the varying ends of transcripts in the RefSeq and Ensembl annotation sets. The ‘polyA sites and clusters’ track shows the polyadenylation (polyA) cluster (red horizontal bar) computed from multiple polyA-seq studies. Each polyA cluster is associated with a polyA signal feature (horizontal green bar). Within the polyA cluster, the polyA site (dark filled rectangle below the polyA cluster) represents the transcript end. The computed polyA site (green vertical highlight) corresponds with the most frequently used polyA site in conventional transcript data (transcript polyadenylated termini track at the bottom) as well as the end of the transcript coverage graph (seen in the INSDC transcript coverage track).

Salient features of MANE Select transcripts

  1. The MANE Select transcript for a human protein-coding gene consists of a pair of identically annotated transcripts, the RefSeq transcript (with an NM_ identifier) and the Ensembl transcript (with an ENST identifier). The two transcripts in the pair have identical sequence and splice structure and the same start and end coordinates.
  2. The MANE Select set includes only curated transcripts from the RefSeq and the Ensembl-GENCODE annotation sets.
  3. MANE Select transcripts match the GRCh38 human reference genome assembly.
  4. Changes to MANE Select transcripts, including sequence changes and/or transcript identifier changes, may occur, but our goal is to stabilize the set and only make changes for compelling reasons.

Manual curation of MANE data

While most of the MANE Select transcripts are chosen computationally, there are cases where the pipeline is unable to choose a suitable transcript due to a variety of reasons (for example, lack of data or insufficient data to make an unequivocal choice). Such cases are reviewed by expert curators from the RefSeq group and the EBI (the GENCODE and the LRG curation groups) to choose the MANE Select transcript. Additionally, curators play a crucial role in maintaining the quality of the MANE data by reviewing MANE Select transcripts flagged by a battery of QA tests.

Accessing MANE Select data

Currently, the MANE Select data can be accessed in the following ways:

  1. Bulk download via FTP: Separate files are provided in GFF3, GTF and FASTA formats for both the RefSeq and Ensembl identifiers, and additionally in GenBank flatfile format for the RefSeq transcripts and proteins. Further information is available in this README file.

  2. NCBI Entrez search: The ‘MANE Select’ keyword included in RefSeq flat files (see RefSeq Select) can be used in Nucleotide and Protein database queries For example: PALM[gene] AND MANE Select[keyword]. The entire list of MANE Select transcripts can be obtained using the Entrez query “Homo sapiens[organism] AND MANE_select[keyword]”. The list can then be downloaded and saved to a file using the “Send to” tab at the top of the search results page.

  3. RefSeq annotation files available via FTP: Column 9 of the GFF and the GTF files contain a “MANE Select” tag attribute (tag=MANE Select in GFF3, or tag ”MANE Select” in GTF), in the rows associated with the mRNA, CDS and exon features. In addition, column 9 also contains the matching Ensembl transcript identifier as an external database reference (Dbxref). Rows in the annotation files associated with the CDS feature contain the MANE Select tag, along with the matching Ensembl protein identifier.

  4. NCBI’s Genome Data Viewer (GDV) or Variation Viewer browsers: The ‘Genes, MANE project (release 0.6)’ track (Figure 4) provides a graphical representation of the MANE Select transcript pair.

  5. Track hub: A track hub of the MANE Select data is available here. The track hub can be used to visualize the MANE Select data in genome browsers such as the UCSC Genome Browser (Figure 5) and the Ensembl Browser (Figure 6).

  6. MANE track in the UCSC Genome Browser: The MANE select v0.5 dataset is also available as a native track in the UCSC Genome Browser (GRCh38/hg38 assembly only; Figure 7)



MANE GDV

Figure 4: A view of the gene PALM in Genome Data Viewer, showing the 'Genes, MANE project (release v0.5)' track. The track includes the transcript and protein identifiers of the MANE Select pair.



MANE track hub UCSC

Figure 5: The MANE v0.6 track hub loaded into the UCSC Genome Browser (GRCh38/hg38 assembly) showing the MANE Select transcripts for the gene DFFA, NM_004401.3 and ENST00000377038.8, in blue and red, respectively.



MANE track hub Ensembl

Figure 6: The MANE v0.5 track hub loaded into the Ensembl Genome Browser showing the MANE Select transcripts for the gene RAB7B, NM_001164522.3 and ENST00000617070.5 in the track titled 'MANE select v0.5', in blue and red, respectively.



MANE track hub UCSC_2

Figure 7: The MANE Select track (NCBI RefSeq and Ensembl transcripts from the MANE project v0.5) in the UCSC Genome Browser (GRCh38/hg38 assembly, gene CTTN) at the bottom of the browser view. The gene name, Ensembl MANE Select transcript and protein identifiers and the NCBI MANE Select transcript and protein identifiers are visible when the cursor hovers over the track identifiers in the left panel (and the options for these identifiers are checked in the track settings).

Note: The MANE Select set will be updated to add transcripts until near-complete genomic coverage is achieved. Therefore, the MANE Select version mentioned in this document may not be the most recent one that is available in NCBI resources.

Contact information

Questions, suggestions and comments on the MANE project may be sent to one of the following email addresses:

Last updated: 2019-08-02T20:39:59Z