• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bibLink to Publisher's site
Brief Bioinform. Jan 2009; 10(1): 35–52.
Published online Jan 29, 2009. doi:  10.1093/bib/bbn047
PMCID: PMC2638621

Next generation tools for the annotation of human SNPs

Abstract

Computational biology has the opportunity to play an important role in the identification of functional single nucleotide polymorphisms (SNPs) discovered in large-scale genotyping studies, ultimately yielding new drug targets and biomarkers. The medical genetics and molecular biology communities are increasingly turning to computational biology methods to prioritize interesting SNPs found in linkage and association studies. Many such methods are now available through web interfaces, but the interested user is confronted with an array of predictive results that are often in disagreement with each other. Many tools today produce results that are difficult to understand without bioinformatics expertise, are biased towards non-synonymous SNPs, and do not necessarily reflect up-to-date versions of their source bioinformatics resources, such as public SNP repositories. Here, I assess the utility of the current generation of webservers; and suggest improvements for the next generation of webservers to better deliver value to medical geneticists and molecular biologists.

Keywords: SNP, bioinformatics, prediction methods, webservers, review

INTRODUCTION

The rapid growth of genomic tools such as single nucleotide polymorphism (SNP) allele genotyping arrays and next-generation DNA sequencing has produced unprecedented amounts of information about the genotypes of individuals in many species. Yet even when association or linkage studies detect statistically significant correlations between a genomic region and a phenotype, the identity of the causative polymorphism often remains unknown. Tracking down functional SNPs is one of the key challenges of modern genetics, and a new branch of computational biology has emerged to support this effort.

The first computational methods designed to predict the biological impact of SNPs appeared almost a decade ago [1–5]. In subsequent years, a variety of methods have been introduced, reviewed in [6–9], and many now provide websites that take SNPs of interest as input and return annotations, including classifications of biological importance [10–18]. Medical genetics and molecular biology researchers are increasingly turning to these methods and websites as an inexpensive way to prioritize SNPs of interest, prior to functional tests [19–29], and even to select tag SNPs for linkage and association studies [30–33]. These methods incorporate material from computer science, applied mathematics and population genetics, including machine learning, probabilistic modeling, statistics, software engineering and phylogeny. To make technical material accessible, specialized terms have been italicized and are defined in a glossary (Table 1).

Table 1:
Glossary of technical terms

The SNP function prediction community currently lacks a gold standard. Available methods have been trained and benchmarked on many different data sets (Table 2), and many methods are applicable to only a subset of all SNPs, such as non-synonymous (amino-acid changing) SNPs, or non-synonymous SNPs that can be mapped onto protein structures. Fair assessment of which methods are best is beyond the scope of this review. Instead, I present a survey of available services, discuss trends in the field, and highlight strengths and weaknesses that may be of interest to a potential user of SNP function prediction webservers.

Table 2:
Computational biology SNP prediction webservers fall into three categories

SNP webservers: strategies and communities

Today's SNP prediction servers generally use one of three strategies (Table 2):

  1. methods servers that disseminate results of original computational method(s);
  2. metaservers that pull information from many servers, including general purpose protein and genomic annotation bioinformatics servers; and
  3. hybrids that both disseminate original method(s) and pull information from other servers.

All of these servers are built on top of an infrastructure of general bioinformatics resources that curate SNPs, genomic and protein sequences, protein structures, interactions, pathways and regulatory elements (such as sites important for transcription factor binding and accurate splicing). The relationships among SNP webservers and other bioinformatics resources can be represented as a directed graph (Figure 1). Partitioning the graph with an algorithm based on local modularity [34] yields three main communities, which can be loosely defined as: the protein community (ellipse), connected to the large, core bioinformatics databases UniProt [35], Protein Data Bank (PDB) [36], Structural Classification of Proteins (SCOP) [37], Biomolecular Interaction Network Database (BIND) [38], Molecular Interactions Database (MINT) [39], Gene Ontology (GO) [40], Kyoto Encyclopedia of Genes and Genomes (KEGG) [41] and BioCarta (http://www.biocarta.com); the regulatory community (trapezoid) connected to webservers that predict post-translational modifications, splicing enhancers and repressors and transcription factor binding sites (TFBSs); and a regulatory plus linkage disequilibrium community (rectangle), which is connected to the HapMap webserver (http://www.hapmap.org). Core resources such as National Center for Biotechnology Information (NCBI) databases (http://www.ncbi.nlm.nih.gov), UCSC Genome Browser [42] (http://genome.ucsc.edu) and Ensembl [43] (http://www.ensembl.org) are used by all three communities. A singleton server (white rectangle), the SNP Function Portal [14], is connected to both protein and linkage disequilibrium communities, and perhaps represents an emergent fourth community. Protein community webservers primarily predict the biological importance of non-synonymous SNPs, using properties such as evolutionary conservation of amino acid sequence, protein structure and protein binding interactions. These properties are often combined in ‘black box’ machine learning algorithms—neural networks, support vector machines and random forests—yielding predictions that are difficult to understand from a biological point of view. The regulatory community primarily harvests predictions from external servers that specialize in identification of regulatory motifs. Although these methods were not designed specifically for SNPs, they can be used, at least in theory, to predict the effect of the SNP on normal patterns of regulation. The third community contains websites connected to resources that provide information about genomic linkage disequilibrium structure.

Figure 1:
Directed graph of relationships among SNP prediction webservers and their bioinformatics sources. A heuristic partition of the graph identifies three communities. They are loosely defined as (1) focus on protein properties (ellipse); (2) focus on regulation ...

The ‘protein community’ is the largest and the oldest. But the general landscape is shifting towards inclusion of regulatory SNPs and consideration of inter-SNP associations through linkage disequilibrium (Figure 2a). The landscape may also be shifting away from methods servers towards meta-servers and hybrids (Figure 2b).

Figure 2:
Trends in scope of SNP webservers. (a) Prior to 2006, protein-based servers that only handle non-synonymous SNPs were predominant. Newer servers include regulatory SNPs and annotate associations among SNPs through linkage disequilibrium estimates. (b) ...

The webserver graph (Figure 1) shows that there is not much feedback to the servers from their sources, although this may change with time. There is currently one exception—a feedback loop connecting two SNP servers in the regulatory/linkage disequilibrium community—SNPeffect [13] and PupaSuite [12]. These servers are synchronized and describe their relationship as a joint effort to cover both protein and regulatory related SNPs. Such relationships may become more common in the next generation.

FIELD TESTING OF CURRENT WEB SERVERS

To assess their usability and scientific utility, I evaluated 22 severs by submitting to each a set of SNPs that were reported to be associated with disease in recently published medical literature. All submissions were done using Firefox 2.0.0.13 on Windows XP Professional Edition. The field tests were done during the week of 28 April 2008. One server returned no results and inquiry emails went unanswered. It was eliminated from the assessment (Pmut [44]). Detailed descriptions of all field tests are provided (Supplementary Tables S2, S3 and S4) with the main results summarized in this section. Since all of these websites were designed by bioinformaticians, it is not surprising that all of them require some bioinformatics expertise on the part of the user. For each server tested, I provide an assessment of the expected user skill set. General definitions of basic bioinformatics skills and expert bioinformatics skills are also provided (Table 3).

Table 3:
Basic and expert bioinformatics skill levels

ALS/FTLD study: novel SNPs discovered in sequencing

Novel SNPs are often discovered through DNA sequencing studies that compare individuals with a condition of interest to a control population. In a recent study of familial amyotrophic lateral sclerosis (ALS) with frontotemporal lobal degeneration (FTLD), researchers investigated sequence variation in the gene TARDBP [45]. All coding exons, most of the 5′-untranslated region, and approximately 100 intronic bases upstream and downstream of each exon were sequenced for 259 ALS/FTLD patients and 1127 controls. The TARDBP (NM_007375) variants 869G->C (amino acid change G290A) and 892G->A (amino acid change G298S) were found to be statistically associated with disease, and putatively linked both with loss and/or gain of protein function.

All of the ‘Methods Servers’ are capable of handling novel non-synonymous SNPs, because they offer the ability to submit a protein sequence along with a residue position and amino acid substitution. None of the ‘Hybrid Severs’ or ‘Meta-servers’ allows submission of protein sequences, but one of the ‘Meta-servers’ (FAST-SNP [15]) handles novel SNPs of all kinds, by allowing the user to submit a DNA sequence plus base position and nucleotide substitution. The TARDBP SNPs were submitted to the SIFT [4], PolyPhen [10], SNAP [17], PMUT, PANTHER [18], nsSNPAnalyzer [46], PhD-SNP [47], Auto-mute [48] and FAST-SNP servers. In cases where servers offered a choice of parameter settings, defaults were used. Generally, the servers reported results that were understandable, if accepted on face value. Most predicted that both SNPs are neutral, and the predictions that disagreed with neutrality were low confidence (Table 4). The servers varied widely in terms of communicating prediction reliability. Some have no confidence measures and some have a simple binary (yes/no) confidence measure. The SNAP server provides the most detailed confidence information, including both a reliability index and an estimated accuracy rate for each prediction. In general, the results are qualitative, rather than quantitative, reflecting the current state-of-the-art of webserver-based SNP function prediction.

Table 4:
Field test of novel SNPs discovered in sequencing

Required user skills

  1. Basic bioinformatics skills (Table 3) such as ability to find and handle data, accession numbers and reference identifiers in web databases such as UniProt, PDB and NCBI.
  2. Bioinformatics expertise (Table 3) is required to understand server errors. Two servers returned error messages that assumed users know about hidden Markov models and protein structural homology.
  3. Bioinformatics expertise is required to think critically about how to interpret server results and their significance.

Interpreting server results

SIFT

In addition to predicting SNP functional impact, SIFT builds a protein multiple sequence alignment of the protein of interest and emails it to the user, allowing alignment analysis with bioinformatics software. I used the SIFT TARDBP alignment to build a phylogenetic tree, using neighbor-joining by sequence identity in JALVIEW [49]. Human TARDBP is located in a distinct clade on this tree. The G290A and G298S SNPs are in a glycine-rich domain that is present only in this clade and appears to be an evolutionary late comer in the TARDBP protein family. Sequence annotations, available through JALVIEW links to European Bioinformatics Institute (EBI) resources, indicate that human proteins in the clade are expressed in brain tissues, rendering plausible the hypothesis that at least one of these SNPs, or a SNP in this protein domain that is in linkage disequilibrium with these SNPs, could contribute to ALS/FTLD, a brain disorder.

FAST-SNP

According to FAST-SNP, the sequence surrounding the SNPs is a significant match to a predicted TFBS, but TFBS predictions are generally not reliable unless the prediction is in a known promoter region. Given that this region is within a coding exon, one should be suspicious of this prediction. FAST-SNP submitted the sequence to three splicing regulatory analysis servers: ESEFinder [50], Rescue-ESE [51] and FAS-ESS [52]. Only one of the three predicted anything. That prediction is that 869G->C (G290A) introduces a significant match (CTAATAG) to the canonical splicing enhancer motif CAGAGGG, which is bound by SF2/ASF proteins. Altogether, this raises the interesting possibility of impact on the regulatory level rather than the protein level.

A user of these SNP methods servers who sees their outputs only on a surface level would conclude that the two ALS/FTD SNPs are neutral. However, a user with bioinformatics expertise (Table 3) might use the server results to suggest testable hypotheses about how these SNPs could affect biological function.

Schizophrenia study: common intronic SNPs

When case–control studies are done with microarray or TaqMan technologies that use SNP probe libraries, researchers may find SNPs in which the frequency differences between cases and controls are statistically significant. These SNPs are not novel, and are already indexed in large databases such as dbSNP [53]. A recent study compared two large schizophrenia populations to ethnically matched controls [54]. Seven SNPs in the introns of PDE4B, which encode a large phosphodiesterase involved in cAMP signaling regulation, were found to be significantly associated with schizophrenia (dbSNP reference identifiers: rs4320761, rs910694, rs1354064, rs1321177, rs2144719, rs1040716 and rs78038).

The ‘schizophrenia SNPs’ were submitted to five servers that handle intronic SNPs: SNPselector [55], PupaSuite, FASTSNP, F-SNP [56] and MutaGeneSys [57] (Table 5). None of the SNPs were predicted to have functional impact by SNPselector, FASTSNP and MutaGeneSys. PupaSuite reported that rs910694 is in a DNA triplex region, a region of DNA with three strands. These regions play a role in repression of transcription, reviewed in [58], thus this SNP could putatively disrupt normal regulation of PDE4B. F-SNP identified three of the SNPs (rs1354064, rs4320761 and rs1040716) as being involved in transcriptional regulation and rs1040716 as being at a position that is conserved among species.

Table 5:
Field test of common intronic SNPs

Required user skills

  1. Submitting queries to these servers requires no special skills, just dbSNP reference identifiers for a SNP of interest.
  2. Bioinformatics skills are required to understand outputs of SNPselector, PupaSuite and F-SNP, even at a surface level.
  3. Unix skills are required to access SNPselector's results, which are sent by email as a compressed tarball.
  4. F-SNP requires expertise specifically with UCSC Genome Browser tools and terms.
  5. The FASTSNP server outputs are integrated into a decision tree algorithm, which is clearly laid out and understandable to a general user. This feature is not available in FASTSNP's ‘novel SNP’ service.
  6. MutaGeneSys requires some knowledge of statistical genetics, as the user must select a minimum coefficient of determination and has the option of selecting a HapMap population. It reports when a SNP is correlated by linkage disequilibrium with an externally annotated disease-associated SNP, based on OMIM [59]. Both single-marker and two-marker correlations are considered.

MutaGeneSys is a tool aimed at the medical genetics community, where the importance of linkage disequilibrium is well understood. By enabling identification of SNPs that are indirectly associated with disease, it can help users narrow down the number of SNPs likely to have a direct functional effect. The PupaSuite result for rs910694 suggests a testable hypothesis that might explain schizophrenia association.

Esophageal cancer study: mix of common exonic and intronic SNP

Esophageal and esophago–gastric junction adenocarcinomas (EAC and EGJAC) have been linked to acid reflux, obesity and smoking. Risk is also related to exposure to nitrites (found in compounds such as tobacco smoke) that alkylate DNA at the O6 position of guanine [60]. A recent population case–control study in Australia looked at SNPs in DNA-repair genes MGMT, XPD, XRCC1 and ERCC1 to identify possible genetic predispositions to EAC and EGJAC. MGMT (O6-methylguanine-DNA methyltransferase) specifically repairs O6-guanine alkylation damage. Results point to MGMT SNPs rs12268840 (intronic) and rs2308321 (non-synonymous) as being statistically significant in frequency between EAC patients (n = 263) and controls (n = 1337) [60].

The MGMT SNPs were submitted to all 22 servers (those that do not handle intronic SNPs were only queried about rs2308321) (Table 6). None of the servers predicted that rs2308321 has an impact on protein function. Several servers reported that this SNP was found at splicing regulation sites, but only F-SNP predicted that it would impact splicing regulation, because it changes both an exonic splicing enhancer and an exonic splicing repressor. None of the servers predicted functional impact for rs12268840.

Table 6:
Field test of non-synonymous and intronic SNPs

Required user skills

  1. The basic skills required to input queries and interpret outputs are the same as described for the TARDBP and ‘schizophrenia SNPs’.
  2. Bioinformatics skills and knowledge of human genome structure allow users to submit advanced input queries. Genomic range is accepted by MutDB [61], Snap [62], PupaSuite, SNP Function Portal [14], F-SNP and LS-SNP [11]. Linkage disequilibrium can be factored into inputs using PupaSuite, SNP Function Portal, SNPselector and MutaGeneSys. In total, 18 distinct input data types are available on the servers tested (Table 5).
  3. Results output of the meta-servers (Table 2B) is generally large, heterogeneous and difficult to integrate without bioinformatics skills. One exception is the FastSNP server, which integrates its harvested data in a decision tree algorithm that is transparent and clearly explained to users.

The only testable hypothesis yielded from these server results was the possibility that splicing regulation of MGMT might be affected by rs2308321. In general, there is poor agreement among servers that harvest predictions of SNP impact on splicing, and the predictions are not associated with clear reliability measures.

Stale data

Most of the tested servers use NCBI's dbSNP [53] database as a primary source of SNP data, but are not up-to-date, increasing the chances that annotations for SNPs of interest will not be available to users. Between 2003 and 2008, dbSNP has been updated, on average 2–3 times per year. Fourteen of the tested servers accept dbSNP rsIDs, and the current dbSNP build is version 129, May 2008. Only one server, FastSNP is using version 128. Seven servers are using version 126; three are using version 125; two are using version 124 and one is still using version 123 (from October 2004) (Supplementary Table 1).

These three field studies suggest a set of desirable features for a SNP webserver:

  1. Options for submission input that require minimal bioinformatics expertise. Even when advanced submission options are available, offering an easy way to input SNPs ensures that a wider community will have access to the server.
  2. Error messages that do not require bioinformatics expertise to understand. Such messages can be confusing and frustrating to users and alienate non-bioinformaticians.
  3. For those with bioinformatics expertise (Table 3), the option to download server outputs such as alignments and protein structure models. The ability to use external bioinformatics software to analyze server output will help bioinformaticians develop testable hypotheses about SNP biological impact.
  4. Quantitative, calibrated measures of prediction reliability. If server output contains many predictions, such as impact on protein structure, impact on exonic splicing, etc., a reliability measure should be provided for each prediction. Without such information, users will have difficulty assessing which prediction is the most likely to be correct.
  5. A method to integrate diverse outputs of heterogeneous data types and to put them in perspective. Without algorithms to integrate and prioritize information available about a SNP, many users will come away with nothing of value.
  6. Ability to handle all kinds of SNPs, possibly through linkouts to other servers. Users will often not know in advance whether SNPs of interest impact protein function or regulation. It is annoying to submit SNPs and find that there is no information about them because you have chosen an inappropriate server.
  7. Ability to report other SNPs in linkage disequilibrium with submitted SNPs. If submitted SNPs are indirectly linked to disease, users will benefit by discovering which other SNPs might be responsible, so that their biological impact can be investigated.
  8. Up-to-date data. The dbSNP database is updated several times a year. The number of new human SNP reference IDs ranges widely (e.g. 44 000 in Build 127, over 6 000 000 in the current Build 129). When SNP webservers do not keep up with these updates, users miss out on coverage of thousands to millions of SNPs.

HOW DIFFERENT ARE THE VARIOUS SNP ANNOTATION METHODS?

A review of current literature reveals that medical geneticists are grappling with issues surrounding the meaning of agreement and disagreement among available SNP annotation methods.

  1. In a meta-analysis study that included computational biology nsSNP methods, predictive scores (for 54 nsSNPs in 37 genes) were compared to lung cancer risk odds ratios from 51 published case–control studies, using a non-parametric correlation test (Spearman rank) [19]. The authors designed a summary statistic which combined scores from SIFT, PolyPhen, SNPs3D [16] and PMut and reported that the summary was more highly correlated with the lung cancer risk odds ratios (r = 0.51) than any of the individual scores. The correlation increase was modest with respect to SIFT, the most highly correlated individual score (r = −0.36). The rationale for combining scores produced by different methods was that each method uses a ‘fundamentally different algorithm’, and that when the algorithms agree, predictions are more trustworthy.
  2. In a case–control study of nsSNPs in nucleotide excision repair genes, putatively linked with prostate cancer [63], SIFT and PolyPhen were used to explore the possible biological impact of seven nsSNPs with significant association to prostate cancer and minor allele frequency > 0.05. The methods disagreed on four nsSNPs and for two out of three on which they agreed, a functional nucleotide excision repair capacity (NERC) assay disagreed with both. The authors tried to explain these disparities by suggesting that PolyPhen uses protein structure information, while SIFT uses evolutionary sequence conservation, but this is not generally true, as described below.

Although users may perceive SNP prediction services as a set of fundamentally different methods, there are major similarities ‘underneath the lid’. For example, SIFT, PolyPhen's PSIC (Position Specific Independent Counts) score and ‘SNPs3D Profile SVM’ (support vector machine) all base their predictions on a multiple sequence alignment of the protein of interest and related proteins. Although PolyPhen does use protein structural information when it is available, for the majority of queries, its predictions are based on amino acid residue properties and PSIC sequence alignment scores [64]. Like SIFT, the PSIC score measures the probability that a substituted amino acid will be tolerated, based on the distribution of amino acids in a multiple sequence alignment column. The measures differ mainly in technical details, such as how pseudocounts and sequence weighting are applied. When SIFT and PolyPhen outputs are substantially different, it is probably because different multiple sequence alignments were used to calculate scores, rather than these details. Inferences based on amino acid column distributions are also used in PANTHER, and as input features to machine learners LS-SNP, SNPs3D, SNAP and PMut. While the decision algorithms used by these different methods are not the same, the correlation among their outputs is the result of similarity among their inputs, and is not necessarily ground for increased confidence.

The authors of the lung cancer meta-analysis assumed that the two ‘SNPs3D SVMs’ (‘SVM Profile’ and ‘SVM structure’) could be grouped together because they are more similar to each other than either one is to SIFT. Emphasis on the SVM algorithm caused them to miss the fundamental similarity between ‘SVM Profile’ and SIFT. A better choice for the summary statistic would be ‘SVM structure’, because it is based on protein structure, and provides an orthogonal prediction to methods based on sequence alignment.

As scientists outside of the bioinformatics community attempt to optimize their use of SNP prediction methods, those within the community must make an effort to better communicate the inner workings of these methods and to clarify both their similarities and differences.

SNP WEBSERVERS: CHALLENGES FOR THE NEXT GENERATION

SNP webservers of the first generation were created by bioinformaticians for bioinformaticians. A major challenge for next generation tools is how to deliver utility to medical geneticists and molecular biologists.

Flexible input tools that can handle high throughput data

Users should have the option of entering from one to thousands of SNPs, including novel SNPs. FASTSNP already allows entry of genes of interest and returns a list of all known SNPs, which can then be selected for annotation. But as the number of candidate SNPs of interest increases, manual selections will not be feasible. Users should be able to enter SNP lists in the form that they receive them from sequencing centers (DNA base change, chromosome position and transcript identifier) or to directly submit the output files from Illumina bead arrays or Affymetrix genotyping arrays.

Leverage of genome correlation structure

Users should be able to find out if their SNPs are in linkage disequilibrium with other SNPs having known or predicted functional effects. MutaGeneSys already allows users to select a preferred correlation threshold and find SNPs listed in OMIM that are in linkage disequilibrium with input SNPs. Such capabilities can be expanded to SNPs having predicted functional impact on regulation or protein function.

Other types of genetic variation

The causative mutation sought in association studies may turn out to be a copy number variant, an inversion, deletion, insertion or frameshift. As other kinds of genetic variation are catalogued, it will be useful both to annotate them and to provide information about linkage disequilibrium between SNPs and these variants.

Cis-regulation

Associations between phenotype and intronic, UTR, and/or promoter region SNPs are prominent in case/control and family studies published over the last several years [65-76]. Yet computational methods to predict the effects of these SNPs lag behind those developed for their impact on proteins. We do not know yet how to accurately detect the sequence signals that identify sites important for transcription, splicing, or miRNA binding, or how to score the impact of a SNP on these sites. Advances in basic science and computational analysis of these elements will play an important role in advancing the utility of SNP webservers.

Inference framework

In addition to serving hypotheses about molecular mechanisms, servers should offer the option of integrating multiple hypotheses and molecular features into a decision algorithm. Without such a framework, and given the growing number of known regulatory mechanisms, users have difficulty for making sense of available information, particularly when harvested by meta-servers. FASTSNP already offers a decision tree framework to integrate information into a risk level (1–5).

Dynamic visualization and analysis tools

The outputs of many servers include protein sequence alignments, structural models, model viewers and structural features. But protein and SNP representations using ribbons, balls and sticks, and multiple sequence alignments cannot provide biological insight to anyone but a protein expert, even if the graphics are interactive. We can maximize the utility of these tools by designing them to help users gain intuition about SNP effects, such as the impact of amino acid substitution. Interactive protein structure graphics could be pre-annotated by ‘painting’ according to biologically important attributes, such as electrostatic surface potential, and the tools could allow users to see how these attributes change with amino acid substitution. Interactive multiple sequence alignment graphics could dynamically display relevant statistics, such as probability that a given amino acid substitution is tolerated in an alignment column. New tools could allow users to experiment selecting different amino acids and to view how the tolerance probability changes.

Dynamic data updates

Most current SNP webservers are based in academic labs and are not supported by full-time staff. Furthermore, these servers were designed to store data locally, requiring regular downloads from their primary sources (such as NCBI, UCSC Genome Browser, UniProt, etc.) and subsequent rerunning of annotation pipelines. It is not surprising that most servers are 2 years or more out-of-date. FAST-SNP and SNPit (http://students.washington.edu/hyshen/research.html which is not yet publicly available) have already made progress on this problem. FAST-SNP uses reconfigurable web wrapper agents to fetch HTML pages, extract relevant data, deliver to a Web Navigation Description Language (WNDL) executor kernel and then to its machine learning algorithm, which renders a decision about SNP risk level. SNPit's wrappers are HTTP servlets that accept queries as URLs and return XML formatted data. It uses a BioMediator ‘source knowledge base’, composed of a central data model and rules to translate the source data models into the common data model. These distributed data integration technologies help ensure that data delivered to the user is up-to-date, although there is nothing they can do about stale data at their sources.

SUMMARY AND CONCLUSIONS

As a whole, the 22 SNP annotation webservers assessed in this study yielded interesting hypotheses to explain why several SNPs might be statistically associated to either ALS, schizophrenia or esophageal cancer in recent medical genetics studies. However, these hypotheses were not immediately apparent and required bioinformatics expertise to sift out from a wide array of ‘black box’ classifications, technical details and predictive scores spanning evolutionary conservation, protein structure, splicing regulators, transcriptional regulators, etc.

The next generation of SNP annotation webservers can take advantage of the growing amount of data in core bioinformatics resources and use intelligent agents to fetch data from different sources as needed. From a user's point of view, it is more efficient to submit a set of SNPs and receive results in a single step, which makes meta-servers the most attractive choice. However, if meta-servers deliver heterogeneous data covering sequence, structure, regulation, pathways, etc., they must also provide frameworks for integrating data into a decision algorithm(s), and quantitative confidence measures so users can assess which data are relevant and which are not. Without progress along these lines, all of this data will only be useful to bioinformatics experts.

Key Points

  • Computational biology methods for SNP annotations can maximize their contributions to medical genetics research by designing services that are easy for researchers who are not bioinformatics experts to use and understand.
  • Medical geneticists and molecular biologists who are interested in using available SNP annotation web servers can select from: (i) servers that disseminate original methods to predict biologically important SNPs; (ii) metaservers, which yield large amounts of heterogeneous bioinformatics information from external servers; and (iii) hybrids which combine (i) and (ii).
  • There is more similarity among bioinformatics SNP annotation methods than many users realize.
  • Developing new algorithms for integrating heterogeneous datatypes is now essential to take advantage of the available information, which can potentially be used to infer the biological impact of SNPs.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

Supplementary Material

[Supplementary Data]

Acknowledgements

The author thanks Dr Melissa Cline for valuable discussions.

Biography

• 

Rachel Karchin is an assistant professor in the Department of Biomedical Engineering and the Institute for Computational Medicine at Johns Hopkins University. Her research focuses on predicting the functional impact of SNPs and tumorigenic somatic mutations on biological systems. She is the originator of LS-SNP, a SNP annotation webserver that suffered from many of the shortcomings described in this review, and her group has just released a substantially upgraded version.

References

  • Sunyaev S, Hanke J, Aydin A, et al. Prediction of nonsynonymous single nucleotide polymorphisms in human disease-associated genes. J Mol Med. 1999;77:754–60. [PubMed]
  • Cargill M, Altshuler D, Ireland J, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999;22:231–38. [PubMed]
  • Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307:683–706. [PubMed]
  • Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–874. [PMC free article] [PubMed]
  • Wang Z, Moult J. SNPs, protein structure, and disease. Hum Mutat. 2001;17:263–270. [PubMed]
  • Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006;7:61–80. [PubMed]
  • Mooney S. Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. Brief Bioinform. 2005;6:44–56. [PubMed]
  • Steward RE, MacArthur MW, Laskowski RA, et al. Molecular basis of inherited diseases: a structural perspective. Trends Genet. 2003;19:505–13. [PubMed]
  • Laskowski RA, Thornton JM. Understanding the molecular machinery of genetics through 3D structures. Nat Rev Genet. 2008;9:141–151. [PubMed]
  • Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30:3894–3900. [PMC free article] [PubMed]
  • Karchin R, Diekhans M, Kelly L, et al. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics. 2005;21:2814–2820. [PubMed]
  • Conde L, Vaquerizas JM, Dopazo H, et al. PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes. Nucleic Acids Res. 2006;34:W621–25. [PMC free article] [PubMed]
  • Reumers J, Maurer-Stroh S, Schymkowitz J, et al. SNPeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs. Bioinformatics. 2006;22:2183–85. [PubMed]
  • Wang P, Dai M, Xuan W, et al. SNP Function Portal: a web database for exploring the function implication of SNP alleles. Bioinformatics. 2006;22:e523–29. [PubMed]
  • Yuan HY, Chiou JJ, Tseng WH, et al. FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res. 2006;34:W635–41. [PMC free article] [PubMed]
  • Yue P, Melamud E, Moult J. SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics. 2006;7:166. [PMC free article] [PubMed]
  • Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35:3823–35. [PMC free article] [PubMed]
  • Mi H, Guo N, Kejariwal A, et al. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res. 2007;35:D247–52. [PMC free article] [PubMed]
  • Zhu Y, Hoffman A, Wu X, et al. Correlating observed odds ratios from lung cancer case-control studies to SNP functional scores predicted by bioinformatic tools. Mutation Res. 2008;639:80–8. [PMC free article] [PubMed]
  • Plourde M, Manhes C, Leblanc G, et al. Mutation analysis and characterization of HSD17B2 sequence variants in breast cancer cases from French Canadian families with high risk of breast and ovarian cancer. J Mol Endocrinol. 2008;40:161–72. [PubMed]
  • Merner ND, Hodgkinson KA, Haywood AFM, et al. Arrhythmogenic right ventricular cardiomyopathy type 5 is a fully penetrant, lethal arrhythmic disorder caused by a missense mutation in the TMEM43 gene. Am J Hum Genet. 2008;82:809–21. [PMC free article] [PubMed]
  • Salzer U, Neumann C, Thiel J, et al. Screening of functional and positional candidate genes in families with common variable immunodeficiency. BMC Immunol. 2008;9:3. [PMC free article] [PubMed]
  • Cameron J, Holla OL, Laerdahl JK, et al. Characterization of novel mutations in the catalytic domain of the PCSK9 gene. J Intern Med. 2008;263:420–31. [PubMed]
  • Holland SM, DeLeo FR, Elloumi HZ, et al. STAT3 mutations in the hyper-IgE syndrome. N Engl J Med. 2007;357:1608–19. [PubMed]
  • Bouchet C, Gonzales M, Vuillaumier-Barrot S, et al. Molecular heterogeneity in fetal forms of type II lissencephaly. Hum Mutat. 2007;28:1020–27. [PubMed]
  • Conen D, Glynn RJ, Buring JE, et al. Natriuretic peptide precursor a gene polymorphisms and risk of blood pressure progression and incident hypertension. Hypertension. 2007;50:1114–19. [PubMed]
  • Dempster EL, Burcescu I, Wigg K, et al. Evidence of an association between the vasopressin V1b receptor gene (AVPR1B) and childhood-onset mood disorders. Arch Gen Psychiatry. 2007;64:1189–95. [PubMed]
  • Gorlov IP, Meyer P, Lilogiou T, et al. Seizure 6-like (SEZ6L) gene and risk for lung cancer. Cancer Res. 2007;67:8406–11. [PubMed]
  • Zeitz C, Forster U, Neidhardt J, et al. Night blindness-assocliated mutations in the ligand-blinding, cysteline-rich, and intracellular domains of the metabotroplic glutamate receptor 6 abolish protein trafficking. Hum Mut. 2007;28:771–80. [PubMed]
  • Nitz I, Fisher E, Weikert C, et al. Association analyses of GIP and GIPR polymorphisms with traits of the metabolic syndrome. Mol Nutr Food Res. 2007;51:1046–52. [PubMed]
  • Tocharoentanaphol C, Promso S, Zelenika D, et al. Evaluation of resequencing on number of tag SNPs of 13 atherosclerosis-related genes in Thai population. J Hum Genet. 2008;53:74–86. [PubMed]
  • Gong Y, Beitelshees AL, Wessel J, et al. Single nucleotide polymorphism discovery and haplotype analysis of Ca2+-dependent K+ channel beta-1 subunit. Pharmacogenet Genomics. 2007;17:267–75. [PMC free article] [PubMed]
  • Rodriguez-Lopez J, Mustafa Z, Pombo-Suarez M, et al. Genetic variation including nonsynonymous polymorphisms of a major aggrecanase, ADAMTS-5, in susceptibility to osteoarthritis. Arthritis Rheum. 2008;58:435–41. [PubMed]
  • Aaron C. Finding local community structure in networks. Phy Rev E Stat Nonlin Soft Matter Phys. 2005;72:026132. [PubMed]
  • Wu CH, Apweiler R, Bairoch A, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–91. [PMC free article] [PubMed]
  • Deshpande N, Addess KJ, Bluhm WF, et al. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2005;33:D233–37. [PMC free article] [PubMed]
  • Hubbard TJ, Ailey B, Brenner SE, et al. SCOP: a structural classification of proteins database. Nucleic Acids Res. 1999;27:254–56. [PMC free article] [PubMed]
  • Bader GD, Donaldson I, Wolting C, et al. BIND—the Biomolecular Interaction Network Database. Nucleic Acids Res. 2001;29:242–45. [PMC free article] [PubMed]
  • Chatr-aryamontri A, Ceol A, Palazzi LM, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007;35:D572–74. [PMC free article] [PubMed]
  • Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9. [PMC free article] [PubMed]
  • Okuda S, Yamada T, Hamajima M, et al. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res. 2008;36:W423–26. [PMC free article] [PubMed]
  • Kuhn RM, Karolchik D, Zweig AS, et al. The UCSC genome browser database: update 2007. Nucleic Acids Res. 2007;35:D668–73. [PMC free article] [PubMed]
  • Hubbard T, Barker D, Birney E, et al. The ensembl genome database project. Nucleic Acids Res. 2002;30:38–41. [PMC free article] [PubMed]
  • Ferrer-Costa C, Gelpi JL, Zamakola L, et al. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics. 2005;21:3176–78. [PubMed]
  • Van Deerlin VM, Leverenz JB, Bekris LM, et al. TARDBP mutations in amyotrophic lateral sclerosis with TDP-43 neuropathology: a genetic and histopathological analysis. Lancet Neurol. 2008;7:409–16. [PMC free article] [PubMed]
  • Bao L, Zhou M, Cui Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 2005;33:W480–82. [PMC free article] [PubMed]
  • Capriotti E, Calabrese R, Casadio R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006;22:2729–34. [PubMed]
  • Masso M, Vaisman II. Accurate prediction of enzyme mutant activity based on a multibody statistical potential. Bioinformatics. 2007;23:3155–61. [PubMed]
  • Clamp M, Cuff J, Searle SM, et al. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–27. [PubMed]
  • Cartegni L, Wang J, Zhu Z, et al. ESEfinder: a web resource to identify exonic splicing enhancers. Nucleic Acids Res. 2003;31:3568–71. [PMC free article] [PubMed]
  • Fairbrother WG, Yeo GW, Yeh R, et al. RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Res. 2004;32:W187–90. [PMC free article] [PubMed]
  • Wang Z, Rolish ME, Yeo G, et al. Systematic identification and analysis of exonic splicing silencers. Cell. 2004;119:831–45. [PubMed]
  • Sherry ST, Ward MH, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–11. [PMC free article] [PubMed]
  • Fatemi SH, King DP, Reutiman TJ, et al. PDE4B polymorphisms and decreased PDE4B expression are associated with schizophrenia. Schizophr Res. 2008;101:36–49. [PubMed]
  • Xu H, Gregory SG, Hauser ER, et al. SNPselector: a web tool for selecting SNPs for genetic association studies. Bioinformatics. 2005;21:4181–86. [PMC free article] [PubMed]
  • Lee PH, Shatkay H. F-SNP: computationally predicted functional SNPs for disease association studies. Nucleic Acids Res. 2008;36:D820–24. [PMC free article] [PubMed]
  • Stoyanovich J, Pe’er I. MutaGeneSys: estimating individual disease susceptibility based on genome-wide SNP array data. Bioinformatics. 2008;24:440–2. [PubMed]
  • Chin JY, Schleifman EB, Glazer PM. Repair and recombination induced by triple helix DNA. Front Biosci. 2007;12:4288–97. [PubMed]
  • Hamosh A, Scott AF, Amberger JS, et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–17. [PMC free article] [PubMed]
  • Doecke J, Zhao ZZ, Pandeya N, et al. Polymorphisms in MGMT and DNA repair genes and the risk of esophageal adenocarcinoma. Int J Cancer. 2008;123:174–80. [PubMed]
  • Dantzer J, Moad C, Heiland R, et al. MutDB services: interactive structural analysis of mutation data. Nucleic Acids Res. 2005;33:W311–14. [PMC free article] [PubMed]
  • Li S, Ma L, Li H, et al. Snap: an integrated SNP annotation platform. Nucleic Acids Res. 2007;35:D707–10. [PMC free article] [PubMed]
  • Lockett KL, Snowhite IV, Hu JJ. Nucleotide-excision repair and prostate cancer risk. Cancer Lett. 2005;220:125–35. [PubMed]
  • Sunyaev SR, Eisenhaber F, Rodchenkov IV, et al. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng. 1999;12:387–94. [PubMed]
  • Gorman CL, Russell AI, Zhang Z, et al. Polymorphisms in the CD3Z gene influence TCRzeta expression in systemic lupus erythematosus patients and healthy controls. J Immunol. 2008;180:1060–70. [PubMed]
  • Wei C-L, Cheung W, Heng C-K, et al. Interleukin-13 genetic polymorphisms in Singapore Chinese children correlate with long-term outcome of minimal-change disease. Nephrol Dial Transplant. 2005;20:728–34. [PubMed]
  • Wang G, van der Walt JM, Mayhew G, et al. Variation in the miRNA-433 binding site of FGF20 confers risk for Parkinson disease by overexpression of ±-Synuclein. Am J Hum Genet. 2008;82:283–89. [PMC free article] [PubMed]
  • García-Closas M, Malats Nr, Real FX, et al. Large-scale evaluation of candidate genes identifies associations between VEGF polymorphisms and bladder cancer risk. PLoS Genet. 2007;3:e29. [PMC free article] [PubMed]
  • Zhang H, Jia Y, Cooper JJ, et al. Common variants in glutamine:fructose-6-phosphate amidotransferase 2 (GFPT2) gene are associated with type 2 diabetes, diabetic nephropathy, and increased GFPT2 mRNA levels. J Clin Endocrinol Metab. 2004;89:748–55. [PubMed]
  • Tokuhiro S, Yamada R, Chang X, et al. An intronic SNP in a RUNX1 binding site of SLC22A4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nat Genet. 2003;35:341–48. [PubMed]
  • Zhang Y, Bertolino A, Fazio L, et al. Polymorphisms in human dopamine D2 receptor gene affect gene expression, splicing, and neuronal activity during working memory. Proc Natl Acad Sci USA. 2007;104:20552–7. [PMC free article] [PubMed]
  • Damcott CM, Ott SH, Pollin TI, et al. Genetic variation in adiponectin receptor 1 and adiponectin receptor 2 is associated with type 2 diabetes in the old order amish. Diabetes. 2005;54:2245–50. [PubMed]
  • Muindi JR, Nganga A, Engler KL, et al. CYP24 splicing variants are associated with different patterns of constitutive and calcitriol-inducible CYP24 activity in human prostate cancer cell lines. J Steroid Biochem Mol Biol. 2007;103:334–37. [PubMed]
  • Shan K, Ying W, Jian-Hui Z, et al. The function of the SNP in the MMP1 and MMP3 promoter in susceptibility to endometriosis in China. Mol Hum Reprod. 2005;11:423–27. [PubMed]
  • Healy J, Belanger H, Beaulieu P, et al. Promoter SNPs in G1/S checkpoint regulators and their impact on the susceptibility to childhood leukemia. Blood. 2007;109:683–92. [PubMed]
  • Thompson JF, Wood LS, Pickering EH, et al. High-density genotyping and functional SNP localization in the CETP gene. J. Lipid Res. 2007;48:434–43. [PubMed]
  • Noble WS. What is a support vector machine? Nat Biotech. 2006;24:1565–67. [PubMed]
  • Yuan HY, Chiou JJ, Tseng WH, et al. FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res. 2006;34:W635–41. [PMC free article] [PubMed]
  • de Berg M, Cheong O, Van Kreveld M, et al. Computational Geometry: Algorithms and Approaches. Berlin, New York: Springer; 2000.
  • Markiewicz P, Kleina LG, Cruz C, et al. Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as ‘spacers’ which do not require a specific sequence. J Mol Biol. 1994;240:421–33. [PubMed]
  • Rennell D, Bouvier SE, Hardy LW, et al. Systematic mutation of bacteriophage T4 lysozyme. J Mol Biol. 1991;222:67–88. [PubMed]
  • Loeb DD, Swanstrom R, Everitt L, et al. Complete mutagenesis of the HIV-1 protease. Nature. 1989;340:397–400. [PubMed]
  • Yip Y, Scheib H, Diemand A, et al. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mut. 2004;23:464–70. [PubMed]
  • Kawabata T, Ota M, Nishikawa K. The protein mutant database. Nucleic Acids Res. 1999;27:355–57. [PMC free article] [PubMed]
  • Stenson PD, Ball E, Howells K, et al. Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet. 2008;45:124–26. [PubMed]
  • Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9. [PMC free article] [PubMed]
  • Carlson CS, Eberle MA, Rieder MJ, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004;74:106–20. [PMC free article] [PubMed]
  • Packer BR, Yeager M, Burdett L, et al. SNP500Cancer: a public resource for sequence validation, assay development, and frequency analysis for genetic variation in candidate genes. Nucleic Acids Res. 2006;34:D617–21. [PMC free article] [PubMed]
  • Pettersen EF, Goddard TD, Huang CC, et al. UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem. 2004;25:1605–12. [PubMed]
  • DeLano WL. San Carlos, CA, USA: DeLano Scientific LLC; 2002. The PyMOL Molecular Graphics System on the World Wide Web.
  • Sayle RA, Milner-White EJ. RASMOL: biomolecular graphics for all. Trends Biochem Sci. 1995;20:374–76. [PubMed]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • GSS
    GSS
    Published GSS sequences
  • MedGen
    MedGen
    Related information in MedGen
  • Nucleotide
    Nucleotide
    Published Nucleotide sequences
  • PubMed
    PubMed
    PubMed citations for these articles
  • SNP
    SNP
    PMC to SNP links
  • Taxonomy
    Taxonomy
    Related taxonomy entry
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...