• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Oct 1, 2008; 24(19): 2272–2273.
Published online Aug 12, 2008. doi:  10.1093/bioinformatics/btn424
PMCID: PMC2553440

The Synergizer service for translating gene, protein and other biological identifiers

Abstract

Summary: The Synergizer is a database and web service that provides translations of biological database identifiers. It is accessible both programmatically and interactively.

Availability: The Synergizer is freely available to all users inter-actively via a web application (http://llama.med.harvard.edu/synergizer/translate) and programmatically via a web service. Clients implementing the Synergizer application programming interface (API) are also freely available. Please visit http://llama.med.harvard.edu/synergizer/doc for details.

Contact: ude.dravrah.smh@htor_ztirf

With the wealth of information available in biological databases has come a proliferation of ‘namespaces’, i.e. schemes for naming biological entities (genes, proteins, metabolites, etc.). For example, a single gene might be identified as ‘IL1RL1’ in the HGNC symbol namespace, ‘ENSG00000115602’ in the Ensembl gene id namespace, and ‘Hs.66’ in the Unigene namespace, while a protein product of that gene might be identified as ‘NP_003847’ in the RefSeq peptide namespace and ‘IPI00218676’ within the International Protein Index (IPI) namespace.

The lack of standardized gene and protein identifiers remains a fundamental hindrance to biological research, and is particularly obstructive to strategies based on integrating high-throughput data from disparate sources (e.g. combining mRNA expression data with protein interaction and functional annotations).

A very common task is the translation of an ordered set of identifiers from one namespace to another. The Synergizer is a database, associated with both programmatic and interactive web interfaces, with the sole purpose of helping researchers (both bench scientists and bioinformaticians) perform this deceptively simple task.

The simplest way to describe the use of the programmatic interface is via a short example Perl script (see Fig. 1) using the Perl module Synergizer::DemoClient (available for download; see Availability above).

Fig. 1.
Use of a typical Synergizer client.

The key functionality of the Synergizer application programming interface (API) is represented here by the function translate (line 11). When executed, this function generates a remote procedure call in the form of a JSON-encoded object, and sends it via HTTP to a remote Synergizer server. This server translates the identifiers in the ‘ids’ argument (lines 7–9) from one namespace (here designated as the ‘domain’) to another (designated the ‘range’), using mappings provided by the specified ‘authority’ [in this case Ensembl (Hubbard et al., 2002)]. These results are returned via HTTP to the script, where they are assigned to the variable $translated as a reference to an array of arrays, one array per original input identifier (since an input identifier may return zero, one or several translations). Some identifiers (e.g. ‘pxn’ in the example) that belong to the domain namespace, but for which no equivalent in the range namespace was found, will return no translations. To highlight inputs that were not found in the domain namespace (e.g. ‘?test?’), these identifiers are translated to the undefined value. For further details, please consult the Synergizer API (see Availability above).

It is important to note that although the example above is written in Perl, the Synergizer service is language independent (as well as platform independent). The API for the service is publicly available and it is a simple matter to write API-conforming clients in Perl, Python, PHP, Ruby, Java, JavaScript or any other modern programming language.

A second illustration of the service is its web front end, (see Availability above), which is itself a Synergizer client application (illustrating the language independence of the Synergizer API, this client is written in JavaScript as opposed to the earlier example in Perl).

Although several tools are available to translate biological identifiers for example, see references Bussey et al., 2003; Côté 2007; Draghici et al., 2006; Huang et al., 2007; Iragne et al., 2004; Kasprzyk et al., 2004; Reimand et al., 2007), the Synergizer has some features that will make it particularly useful to bioinformaticians.

Perhaps the Synergizer's greatest asset is its simplicity. It is designed to perform a single task, bulk translation of biological database identifiers from one naming scheme (or namespace) to another, as quickly and simply as possible. The service obtains its information from authorities, such as Ensembl (Hubbard et al., 2002) and NCBI (Wheeler et al., 2008), that publish detailed correspondences between their identifiers and those used by external databases. In general, we say that two identifiers are ‘synonymous’, according to a specific authority, when the authority assigns them to its same internal identifier. (For brevity, we refer to the authority's internal identifier as the ‘peg’.) For example, we would say that, according to authority Ensembl, the identifiers IL1RL1 and Q01638 are synonymous because it assigns both to its internal identifier (peg) ENSG00000141510. For the same reason, 6 01 203 and 5998 are synonymous, but in this second example the synonyms are simple numbers that give no indication of the database of origin. For this reason, every identifier listed by the Synergizer service belongs to a specific ‘namespace’. In this example, 6 01 203 belongs to the namespace mim_gene_accession, and 5998 belongs to the namespace hgnc_id. More formally, ‘namespace’, as we use the term, is a collection of identifiers, all generated by the same organization, in a controlled way.

Regarding namespaces, it is worth noting that some providers of biological information follow the practice of prepending a prefix to the identifier to indicate the database of origin (e.g. HGNC:IL1RL1), which has the same effect of segregating identifiers according to namespace. The Synergizer system generally does not follow this practice.

Also, the identifiers discussed until now, which belong to well-defined namespaces, must be distinguished from those that are proposed ad hoc, one-at-a-time, by the researchers who first describe them in the literature. These ad hoc identifiers are only within the scope of the Synergizer service where they correspond to a tightly controlled namespace for which some authority offers correspondences to other namespaces.

For each authority, Synergizer uses a peg that is specific to that authority. For example, for Ensembl currently it is the Ensembl gene id, and for NCBI it is the Entrez gene id. But the choice of peg for a given authority is an implementation detail that may change in the future.

By this procedure, we generate a repository of synonym relationships between database identifiers. When we do this we often find discrepancies among various authorities. The reasons for these discrepancies are varied. They range from simple time lags between databases, to policy differences among the authorities on the assignment of external identifiers to their respective internal identifiers, and even to more substantive disagreements, at the scientific level, on gene assignments. Rather than attempting to resolve these discrepancies, the synonym relationships served by each authority are kept separate within the Synergizer system.

The Synergizer's schema has been designed to preserve the provenance of all synonym relationships, and to accommodate new sources of synonym information over time.

To access the Synergizer's interactive web interface visit the link listed under Availability above. To use the interface, simply paste the identifiers to be translated in the input field (or, alternatively, enter the name of a local file from which to upload the identifiers). Then, choose the domain and range namespaces. It is also possible to specify the special catchall domain namespace ‘__ANY__’ (although we note that specifying the domain namespace recommended where possible since it is less prone to ambiguous translation). By default the output is in the form of an HTML table, but there is also the option to obtain the output in the form of a spreadsheet.

Currently the Synergizer supports synonyms from two different authorities Ensembl and NCBI, and holds a total of just over 20 million synonym relations covering over 70 species and over 150 namespaces.

ACKNOWLEDGEMENTS

We thank J. Beaver, E. Birney, C. Bult, R. Gerszten, A. Kasprzyk, D. Maglott, J. Mellor, T. Shtatland and M. Tasan for helpful discussions, and technical and editorial advice.

Funding: National Institutes of Health. (grants HG003224, HG0017115, and HL081341), Keck Foundation.

Conflict of Interest: none declared.

REFERENCES

  • Bussey KJ, et al. MatchMiner: a tool for batch navigation among gene and gene product identifiers. Genome Res. 2003;4:1–7. [PMC free article] [PubMed]
  • Côté R, et al. The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics. 2007;8:401. [PMC free article] [PubMed]
  • Draghici S, et al. Babel's tower revisited: a universal resource for cross-referencing across annotation databases. Bioinformatics. 2006;22:2934–2939. [PMC free article] [PubMed]
  • Huang H, et al. Integration of bioinformatics resources for functional analysis of gene expression and proteomic data. Frontiers in Bioscience. 2007;12:5071–5088. [PubMed]
  • Hubbard T, et al. The Ensembl genome database project. Nucleic Acids Res. 2002;30:38–41. [PMC free article] [PubMed]
  • Iragne F, et al. Aliasserver: a web server to handle multiple aliases used to refer to proteins. Bioinformatics. 2004;20:2331–2332. [PubMed]
  • Kasprzyk A, et al. Ensmart: A generic system for fast and flexible access to biological data. Genome Res. 2004;14:160–169. [PMC free article] [PubMed]
  • Reimand J, et al. g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 2007;35(suppl. 2):W193–W200. [PMC free article] [PubMed]
  • Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008;36(suppl. 1):D13–D21. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • Protein
    Protein
    Published protein sequences
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...