• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Bioinformatics. Author manuscript; available in PMC Feb 7, 2006.
Published in final edited form as:
PMCID: PMC1361283
NIHMSID: NIHMS7086

SNPselector: a web tool for selecting SNPs for genetic association studies

Abstract

Summary: Single nucleotide polymorphisms (SNPs) are commonly used for association studies to find genes responsible for complex genetic diseases. With the recent advance of SNP technology, researchers are able to assay thousands of SNPs in a single experiment. But the process of manually choosing thousands of genotyping SNPs for tens or hundreds of genes is time consuming. We have developed a web-based program, SNPselector, to automate the process. SNPselector takes a list of gene names or a list of genomic regions as input and searches the Ensembl genes or genomic regions for available SNPs. It prioritizes these SNPs on their tagging for linkage disequilibrium, SNP allele frequencies and source, function, regulatory potential, and repeat status. SNPselector outputs result in compressed Excel spreadsheet files for review by the user.

Availability: SNPselector is freely available at http://primer.duhs.duke.edu/

Contact: hong.xu/at/duke.edu, mike.hauser/at/duke.edu

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are the most common form of polymorphism in the human genome. A variety of genotyping platforms are available for high-throughput assay of SNPs. They are widely used for human evolution research (Hammer et al., 2001; Underhill et al., 2000), association studies of complex diseases (Colomb et al., 2001; Martin et al., 2001), and studies of pharmacogenetics (Goldstein, Tate, & Sisodiya, 2003).

The amount of SNP data in public databases is increasing dramatically. The number of unique human SNPs in the current dbSNP release (build 123) is more than 10 million, approaching the theoretically expected number of SNPs in the human genome (Kruglyak & Nickerson, 2001). The SNP detection method varies, as does the reliability of the SNPs in dbSNP. Only 50% of the SNPs are validated, and less than 20% of the validated SNPs have allele frequency information. In addition to the validation and allele frequency information, it is also important to select SNPs based on their genomic location and their proposed functional significance (coding, intronic, promoter, etc.). These annotation data are available in various resources, such as NCBI dbSNP (http://www.ncbi.nlm.nih.gov/SNP), UCSC Genome Browser (Karolchik et al., 2003), and Ensembl (Birney et al., 2004). They also provide data mining features to retrieve SNP information. There are several other bioinformatics tools developed to select SNPs based on various properties. PromoLign (Zhao et al., 2004) and PupaSNP Finder (Conde et al., 2004) are two web tools to find SNPs that may affect gene transcription levels. SNPper (Riva & Kohane, 2002) provides a web interface to retrieve SNP annotation by chromosome region or SNP names and to refine SNP selection with different filters (such as validation status or minor allele frequency).

Here we describe a new SNP selection program that combines many of these attributes. It has an easy-to-use web interface that provides a feature-rich result spreadsheet. It incorporates LD calculations into SNP selection to help reduce the number of SNPs required for a comprehensive analysis. Further, the output can be tailored to provide the required fields for commercial genotyping systems, such as the Illumina bead-based genotyping platform.

IMPLEMENTATION

SNPselector is implemented in object-oriented PERL language to search and analyze SNP data. Its core module can be run as a UNIX command-line application. To make it easy to use, a CGI wrapper is developed to provide the web interface between the users and the application.

To increase the performance of the application, all the SNP data and related genome annotation data are stored in a local MySQL database (http://www.mysql.com/). SNP data, including SNP location, alleles, function, and validation information were downloaded from UCSC Genome Browser server. Later two 100bp-flanking sequences for each SNP were extracted from the human genome (NCBI build 35) and added into the SNP table. SNP allele frequency and genotyping data were downloaded from the HapMap project (http://www.hapmap.org/), the SNP Consortium (http://snp.cshl.org/), JSNP (http://snp.ims.u-tokyo.ac.jp/), Affymetrix (http://www.affymetrix.com/), and Perlegen (Hinds et al., 2005). The SNPs with experimentally-verified genotyping or allele frequency information are considered as “high quality” SNPs. Ensembl gene structure information was obtained from the Ensembl project. Conserved region information was downloaded from the UCSC genome browser multi-genome alignments (Blanchette et al., 2004). CpG island, transcription factor binding site (TFBS), microRNA, and simple repeat data were also downloaded from UCSC and stored in the local MySQL database.

The local database is updated whenever new public data are released.

PROGRAM WORKFLOW

SNPselector takes a list of gene names or genomic regions as input and finds all available SNPs in the genes or genomic regions. SNPselector finds tagging SNPs by calculating LD bins of genotyped SNPs (Carlson et al., 2004). It then finds SNP function based on whether the SNP may affect the gene transcript structure or the protein product. It checks the regulatory potential of the SNP based on SNP location at conserved site from multi-genome comparison, conserved transcription factor binding site, CpG island, or microRNA gene. It also checks whether the SNP is in a repeat region. It scores and sorts SNPs on their LD tagging property, quality, function, regulatory potential, and repeat status. Finally it exports SNP selection result into Excel files. (Figure 1).

Fig. 1.
Workflow of SNP selection process.

SNP search

The SNPselector provides the user with 4 types of SNP searches:

  1. by dbSNP accession ID (rs number).
  2. by gene names: Ensembl genes and their chromosomal locations are obtained. For each Ensembl gene, SNPselector searches all SNPs in the corresponding chromosomal region (plus the flanking sequence regions defined in the user input).
  3. by genomic regions (gene centric): SNPselector finds all Ensemble genes and their chromosome locations in that genomic region. For each Ensembl gene in the region, SNPselector searches all SNPs within the gene.
  4. by genomic regions: The program breaks each region into smaller 2Mb regions if necessary. Then it will search all SNPs in each of the 2Mb region.

SNP retrieval

After searching by one of these methods, SNPselector retrieves SNP information from the database by SNP rs numbers. The information includes SNP allele, allele frequency, chromosomal location, validation status, quality, predicted function, and flanking sequence. When obtaining the SNP flanking sequence, SNPselector annotates any neighboring SNPs within 100 bases of the target SNP by using the IUPAC codes. This ensures that no assay will be designed over a neighboring SNP, which can cause failure of many SNP genotyping assays. SNPselector also compares the SNP location with Ensembl transcript structure to determine whether the SNP is intronic, exonic, or intergenic, and annotates at which exon or intron the SNP is located if it is not an intergenic SNP.

For SNPs that are queried by gene or genomic region, SNPselector calculates LD bins using the HapMap genotyping data and the “ldSelect” program from University of Washington (Carlson et al., 2004). This helps to select the most informative SNPs and to avoid genotyping redundant SNPs. The Perlegen genotyping data of African American, Caucasian, and Chinese was also added into the SNPselector database. Users can select one of the genotyping data sources or populations to do LD bin analysis.

SNP scoring and prioritization

SNP scoring After retrieving SNP information, SNPselector scores each SNP in multiple categories.

  1. LD score: If the SNP is a tagging SNP of an LD bin, its LD score is assigned as the number of SNPs in the LD bin. Otherwise, its LD score is zero. The LD score reflects how informative the tagging SNP is. The higher the LD score, the more SNPs in the LD bin, the more SNP information can be assayed by the tagging SNP.
  2. Quality score: If the SNP has experimentally-verified genotyping or allele frequency information, it is considered as a “high-quality” SNP. Its quality score is 1. Otherwise, its quality score is 0.
  3. Function score: the SNP function score is based on the SNP type annotation from dbSNP. A higher score is assigned to the SNPs that might affect gene transcript structure or protein product, such as coding nonsynonymous SNPs or SNPs at a splicing site (Table 1).
    Table 1.
    SNP type and its function score
  4. Regulatory potential score: For each SNP not in an exonic region, SNPselector calculates its potential regulatory score base on its location within human / chimp / mouse / rat / dog / chicken / fugu / zebra_fish conserved regions, conserved transcription factor binding site (TFBS), CpG island, or microRNA gene (Table 2). These scores are added to build a single regulatory potential score. Thus a high score suggests high regulatory potential.
    Table 2.
    SNP type and its function score
  5. Repetitive score: If the SNP overlaps with a simple repeat region annotated by UCSC, its repetitive score is 1. Otherwise, its repetitive score is 0.
  6. Illumina pre-assay score: If SNP genotyping is to be performed with the Illumina bead platform, the user can upload an optional file containing the Illumina pre-assay score into the database. This score is calculated by the Illumina proprietary algorithm to assess the success rate of genotyping the SNP with this platform.

SNP prioritization After searching and assigning scores to SNPs, SNPselector sorts SNPs by LD score in descending order so that the tag SNPs with the larger LD bin will be at the top. Then SNPselector sorts SNPs by quality score in descending order, followed by functional score in descending order, so that SNPs with functional impact, such as non-synonymous coding SNPs, will have higher rank than those with unknown function. SNPselector also sorts SNPs by regulatory potential score so that SNPs in conserved regions, CpG islands, or transcription factor binding sites, will be ranked higher than those outside these regions. Finally SNPselector sorts SNPs by repetitive score in increasing order so that SNPs in non-repetitive regions will be ranked higher than those in repetitive regions. Since the output is an easily manipulated spreadsheet, the user can sort the SNPs to highlight different SNP features. For example, if the user wants to find SNPs that might affect gene expression, he/she may choose to sort SNPs by regulatory potential score before sorting SNPs by function score.

SNP selection and data report

For SNPs that are queried by SNP accession IDs, SNPselector selects and exports all the queried SNPs in one Excel spreadsheet file. For SNPs queried by gene names or gene locations, SNPselector exports top ranked SNPs at the user specified number per gene into one Excel spreadsheet file. It also exports all SNPs available for each gene into a second gene-SNP spreadsheet. For genomescan SNPs that are queried by genomic regions, SNPselector selects evenly distributed SNPs at the user-specified spacing (in base pairs) and puts the result into one Excel spreadsheet file. In each gene or genome SNP Excel file, SNPselector generates a hyper link called “DAS Link” at the first field of the first row. It links to the LD bins and selected SNPs are displayed as customer tracks in the UCSC genome browser (Figure 2).

Fig. 2.
Display LD Bins and selected SNPs as customer tracks in UCSC Genome Browser. (a): The hyper link — “DAS Link” in SNP report Excel spreadsheet. (b): LD bins and selected SNPs are displayed as customer tracks in UCSC genome browser. ...

RESULTS AND DISCUSSION

SNPselector was used to select 5 SNPs for each of 140 candidate genes for human cardiovascular disease (Seo et al., 2004). The 140 genes were widely distributed across the human genome, and SNPs had previously been manually selected from these genes. Among the 700 SNPs selected by SNPselector, all were high quality SNPs with allele frequency or genotyping data, and 582 (83%) were LD tagging SNPs.

Figure 3 shows the distribution of selected SNPs and total SNPs of the 140 genes in different functional categories. The majority of the SNPs were intronic SNPs. Through the SNP prioritizing rules, SNPselector decreased the percentage of intronic SNPs from 67.48% in the total available SNPs to 48.86% in the final selected SNPs. It also enriched for SNPs that might have an effect on gene function. These included coding-nonsynonymous SNPs (enriched 7 times), splice-site SNPs (enriched 14 times), coding-synonymous SNPs (enriched 7 times), and mRNA-UTR SNPs (enriched 2 times). This enrichment of functional SNPs by SNPselector was similar to the SNPs selected by the manual SNP selection process. In some categories, such as splice-site and mRNA-UTR, SNPselector did even better than the manual SNP selection. SNPselector prioritizes splice-site SNPs at the same level as codingnonsynonymous SNPs, and chooses UTR SNPs located in conserved genomic region.

Fig. 3.
The distribution of selected SNPs and total SNPs in different functional categories.

There are a few limitations to SNPselector as it is currently configured. The software requires SNP genotyping data to calculate tagging SNPs with ldSelect. To provide the richest possible dataset, we have merged HapMap genotyping data with other genotyping information, such as Perlegen genotyping data (Hinds et al., 2005). This approach generates a large number of LD bins when there is little overlap between the genotyped datasets. This will become less of an issue as the HapMap genotyping progresses. SNPselector infers each SNP's impact on gene function based on its location (e.g. coding region, promoter site, or UTR.) However, SNPs located in these regions may not affect the gene function. More detailed functional annotation will be added as it becomes available (Karchin et al., 2005).

In summary, SNPselector is a powerful tool for the identification of SNPs for large-scale genetic association studies. This software's output is comparable to that obtained from manual selection, but can be produced in a fraction of the time. The detailed descriptive output can be formatted for submission to a variety of commercial genotyping systems. SNPselector will be a valuable addition to many high-throughput SNP genotyping applications.

ACKNOWLEDGEMENTS

We would like to thank Carrie Browning, Liyong Wang, Jason Rose, and others for helpful suggestions. This work was supported by the following grants: P01 HL73042 (NHLBI); R01 AG021547, R01 NS36768, R01 NS31153 (NINDS); R01 AG19085 (NIA); R01 EY12012, R01 EY13315 (NEI).

REFERENCES

  • Birney E, et al. An overview of ensembl. Genome Research. 2004;14:925–928. [PMC free article] [PubMed]
  • Blanchette M, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Research. 2004;14:708–715. [PMC free article] [PubMed]
  • Carlson CS, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics. 2004;74:106–120. [PMC free article] [PubMed]
  • Colomb E, et al. Association of a single nucleotide polymorphism in the TIGR/MYOCILIN gene promoter with the severity of primary open-angle glaucoma. Clin. Genet. 2001;60:220–225. [PubMed]
  • Conde L, et al. PupaSNP Finder: a web tool for finding SNPs with putative effect at transcriptional level. Nucleic Acids Res. 2004;32:W242–W248. [PMC free article] [PubMed]
  • Goldstein DB, et al. Pharmacogenetics goes genomic. Nat. Rev. Genet. 2003;4:937–947. [PubMed]
  • Hammer MF, et al. Hierarchical patterns of global human Y-chromosome diversity. Mol. Biol. Evol. 2001;18:1189–1203. [PubMed]
  • Hinds DA, et al. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307:1072–1079. [PubMed]
  • Karchin R, et al. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics. 2005;21:2814–2820. [PubMed]
  • Karolchik D, et al. The UCSC Genome Browser Database. Nucleic Acids Res. 2003;31:51–54. [PMC free article] [PubMed]
  • Kruglyak L, Nickerson DA. Variation is the spice of life. Nature Genetics. 2001;27:234–236. [PubMed]
  • Martin ER, et al. Association of single-nucleotide polymorphisms of the tau gene with late-onset Parkinson disease. Journal of the American Medical Association. 2001;286:2245–2250. [PMC free article] [PubMed]
  • Riva A, Kohane IS. SNPper: retrieval and analysis of human SNPs. Bioinformatics. 2002;18:1681–1685. [PubMed]
  • Seo D, et al. Gene Expression Phenotypes of Atherosclerosis. Arterioscler.Thromb.Vasc.Biol. 2004;24:1922–1927. [PubMed]
  • Underhill PA, et al. Y chromosome sequence variation and the history of human populations. Nature Genetics. 2000;26:358–361. [PubMed]
  • Zhao T, et al. PromoLign: a database for upstream region analysis and SNPs. Human Mutation. 2004;23:534–539. [PubMed]

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...