Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2003; 31(13): 3746–3750.
PMCID: PMC168975

PROBEmer: a web-based software tool for selecting optimal DNA oligos

Abstract

PROBEmer (http://probemer.cs.loyola.edu) is a web-based software tool that enables a researcher to select optimal oligos for PCR applications and multiplex detection platforms including oligonucleotide microarrays and bead-based arrays. Given two groups of nucleic-acid sequences, a target group and a non-target group, the software identifies oligo sequences that occur in members of the target group, but not in the non-target group. To help predict potential cross hybridization, PROBEmer computes all near neighbors in the non-target group and displays their alignments. The software has been used to obtain genus-specific prokaryotic probes based on the 16S rRNA gene, gene-specific probes for expression analyses and PCR primers. In this paper, we describe how to use PROBEmer, the computational methods it employs, and experimental results for oligos identified by this software tool.

INTRODUCTION

In recent years there has been a tremendous increase in DNA sequence information, including whole genome sequences, publicly available gene databases [e.g. GenBank (1)] and clone libraries which have been sequenced by individual laboratories. This information is important for developing PCR primers and for multiplexed detection platforms like oligonucleotide microarrays (2) or bead-based methods (3) where oligonucleotide probes must be capable of discriminating among thousands of target sequences in the analyte. One application is to identify and quantify selected bacterial DNA sequences in a PCR product obtained from an environmental sample. Another application is to construct a suite of oligos each targeting a different gene in a eukaryotic organism.

This paper describes a software tool called PROBEmer that enables optimal oligos to be designed for a single target sequence or a group of target nucleic acid sequences. The target group (T-group) is a set of sequences for which a common substring is desired. For example, the T-group could consist of the set of 16S rRNA gene sequences for a microbial genus, while the non-target group (NT-group) would be all of the other sequences in a 16S rRNA database. PROBEmer finds all oligos of a specified length that exactly match a desired percentage of the T-group but do not match any sequences in the NT-group. The results are displayed in multiple ways enabling the user to ascertain the number and position of single base mismatches with T- and NT-group sequences. This information is essential for designing oligo-nucleotides that minimize cross hybridization. PROBEmer possesses a web interface that allows the analysis to be performed on any platform with a JavaScript-capable internet browser.

PROBEmer differs from other probe-finding software in several important ways. Firstly, unlike the ARB package (http://www.arb-home.de), PROBEmer uses the suffix array data structure to locate common substrings in place of an initial multi-alignment of the database and is therefore not restricted to sequences sharing a high degree of similarity. Thus it is capable of finding probes for a target gene among all other genes in a genome. Secondly, because PROBEmer uses suffix tree-based algorithms instead of a BLAST database search (4,5), the user may supply any NT-group desired. Thirdly, because a sufficient understanding of free energy associated with solid phase DNA reactions (6) does not exist at this time to accurately predict experimental measurements of hybridization, PROBEmer focuses on string comparisons, not free energy calculations, to optimize the oligos.

The following sections describe the types of computations conducted by PROBEmer, the input that is required and the nature of the output. Experimental results are shown for probes targeting microbial genera. Oligos designed with PROBEmer are compared with literature sequences for Saccharomyces cerevisiae. Throughout this paper we use the terms ‘oligo’, ‘probe’ and ‘substring’ interchangeably but recognize there are distinctions based on specific applications of the oligonucleotide sequences.

COMPUTATIONAL METHODS

Overview

PROBEmer consists of three modules: Oligo Select, Oligo Check and Sequence Extract. Probe selection occurs within the Oligo Select module in six steps:

  1. The user specifies the T-group, the NT-group and various input parameters shown in Table Table1.1. The T-group can be uploaded in FASTA format or produced by the Sequence Extract module using any of 13 databases currently internal to PROBEmer. The NT-group can be uploaded in FASTA format or produced from an internal database with keywords to exclude certain sequences.
    Table 1.
    Inputs for the three modules of PROBEmer
  2. PROBEmer finds a set of probes common to a specified fraction of sequences in the T-group.
  3. These probes are filtered by GC content, position, self-homology and melting temperature.
  4. The probes are then compared to the NT-group (unless ‘None’ is specified). Probes that appear in too many members of the NT-group are eliminated.
  5. The candidate probes along with relevant information are displayed in a Results Window with hyperlinks to related pages.
  6. The researcher must navigate through the information to decide which probes are best for his application.

A flowchart is shown in Figure Figure1.1. The goal is to provide a wide spectrum of data and a flexible interface to facilitate final selection of optimal probes by the user.

Figure 1
Flowchart showing major steps in PROBEmer.

Sequence Extract module and available databases in PROBEmer

The Sequence Extract module provides a method to extract sequences through a keyword database search of sequence comment lines. If multiple keywords are entered, all sequences that match any of the query words are returned (inclusive OR). The user may upload his own database in FASTA format or use any of 13 internal databases indexed within PROBEmer, including the Ribosomal Database Project (RDP) 16S rRNA database version 8.1 (7), the ORFs of ten microbial genomes from TIGR CMR (8) and the available gene sequences of two eukaryotic organisms obtained from GenBank. In the case of the RDP database, all redundant sequences and sequences <600 bp were removed to improve probe selection performance. The output of the search is a file in FASTA format that either can be sent to the web browser window or saved on the local hard drive.

Probe selection

In the Oligo Select module (Table (Table1),1), the initial processing identifies oligos of a specified length that occur in the T-group. In practice, since it may not be possible to find an oligo that matches all sequences in the T-group, the user must specify the minimum percentage of T-sequences that each probe must match. The algorithm is a suffix array search (9,10) which enables common sequences to be found more quickly than a sequence multi-alignment and can be used on any set of sequences with conserved regions. The output from this stage is the initial list of probes.

The user may filter the list with additional criteria. The chemical characteristics available are GC content, melting temperature (Tm) and self-homology (complementarity)—this last uses the same heuristic as in PRIMER3 software (11). In addition, the user can specify the range of positions where the probes appear in the T-sequences.

The probes are further filtered by searching for exact matches within the NT-group. This is done with the linear-time suffix tree streaming algorithm used within MUMmer 2.0 (12), slightly modified to keep track of the number of NT-group matches to each candidate probe. Any sequences with more exact matches than desired are eliminated from further analysis. The output is the final list of candidate probes.

Oligo Select results window

Because the list of candidate probes may be long and the criteria for selecting a probe may be complex, the data are displayed with multiple features with hyperlinks to help the user select an optimal probe. The ‘Download as tab-delimited text’ option provides an easy method of exporting oligo results into a spreadsheet or other document format for further manipulation and long-term storage. Results may also be bookmarked and accessed for a short period following the initial computation.

An example is shown in Figure Figure2.2. The Sequence Extract module was used to construct a T-group from the 16S rRNA database with the keyword ‘Geobacter’ and the NT-group was all other 16S rRNA sequences. A minimum of 50% of the T-group sequences had to contain the probe, each of which could not match >20 sequences in the NT-group. The probes are displayed in a window either in alphabetical order or by average position in the T-sequences. Average position was implemented in PROBEmer to deal with inconsistent start locations within T-group sequences, a problem prevalent in, for example, 16S rRNA oligo selection. Because different subregions of 16S rRNA are in the database, the same substring will occur at varying positions within those subregions. Since it is impractical to output all locations on the summary results page, the mean is displayed to give at least some indication of the general location within this set. This value will be misleading, of course, when there is no clear consensus start location within the T-group. The list of all exact start locations may be viewed by clicking on the ‘Found in’ information.

Figure 2
Screenshot of Probe Results Window using Geobacter sequences extracted from RDP 8.1. The oligonucleotide sequences shown are substrings of the target sequences—the corresponding probe would be the reverse complement. The oligos are ordered alphabetically. ...

For the first probe on the list, ‘Found in: 9 of 15’ indicates that the probe is present in nine of the 15 T-sequences; these sequences can be displayed by clicking on the hyperlink. For ‘Avg. Pos: 543’, the positions of the probe in each target sequence were averaged together. The numerical value is helpful when PCR primers are known and a probe needs to be located between the primers. ‘NT matches’ indicates the number of NT-sequences that contain the probe. For example, the fourth probe on the list matched 12 NT-sequences that can be studied by clicking on the hyperlink. If ‘0’ had been specified for the maximum number of NT-sequence matches, this column would have been omitted since all listed probes would have been unique.

n-off-match and Oligo Check module

Inexact matches (i.e. near neighbors) to each candidate probe may be evaluated by clicking the rightmost column in the Oligo Select results window (Fig. (Fig.2).2). The term ‘n-off-match’ refers to the Hamming string distance, i.e. the number of single base mismatches between the candidate probe and a sequence in the NT-group. By clicking on ‘4-off match check’, all NT-sequences with four or fewer base mismatches to the probe are shown (Fig. (Fig.3).3). Alternatively, the nearest neighbors may be displayed by clicking on ‘Obtain closest match distance’. This converts ‘4-off-match check’ to ‘d-off matches’ where d is the lowest string distance to a sequence within the NT-group. Clicking on ‘Sort by closest match distance’ will order the output by decreasing match distance rather than by numerical or alphabetical order. Since each mismatch may potentially reduce cross-hybridization, this allows a user to find oligos more easily with maximal string distance to the NT-group.

Figure 3
Screenshot of the Oligo Check and n-off-match results using a potential Geobacter probe. The user may query with either the probe or the target sequence. Neighbors with two or fewer mismatches are shown in the order they occur within the RDP v8.1 16S ...

In Figure Figure3,3, the oligo sequence is shown at the top. The accession number with hyperlink to the GenBank page, if available, and comment line for each neighbor are displayed. The alignment of the probe with the neighbor is shown where a ‘-’ indicates an exact base match. A letter indicates a mismatch. The numbers on the right refer to the position of the mismatch within the probe. This method of displaying the information enables the user to quickly ascertain the quality of the match, an important issue when designing probes to discriminate among highly similar sequences. When a probe is known and needs to be evaluated, the query sequence may be entered in the Oligo Check module with a specification of the maximum number of base mismatches of interest.

The algorithm for locating the nearest neighbor uses a greedy approximate search of the suffix tree data structure by searching for matches with successively greater numbers of mismatches (i.e. substitution errors). While searching for a match with k errors, all positions in the tree that could lead to a match with k+1 errors are saved. If no complete match with k errors is found, the search is resumed from the saved positions, saving for the next round the positions with k+2 errors, and so forth. Thus the first complete match found will have the minimum number of mismatches. For probes using modest nearest neighbor values (<6), our empirical results are that this algorithm is sublinear with respect to the size of the database used. Specific alignments are obtained by performing successive Tarhio-Ukkonen (13) searches on each sequence within the database. This algorithm is sublinear in the cases where the number of mismatches allowed is small or the probe is large in size.

Platforms and web interface

PROBEmer was developed on a Linux workstation and was written in C. Currently PROBEmer is available at http://probemer.cs.loyola.edu and is running on a dual Pentium III 1.0 GHz machine with 4 Gb of available memory. A processing queue has been implemented to allow more effective use of the software by multiple users.

The web-based interface was written in PERL6 and JavaScript and is intended to support a wide variety of computer platforms. The oligonucleotide selection process was streamlined based on the results of case studies of both experienced and inexperienced users of the PROBEmer interface. It has been successfully tested with the Netscape 4+ and Internet Explorer 5+ web browsers in multiple operating system environments. Documentation is available on the server web site through the main page.

TEST CASES

Probes for microbial genera and experimental test

Probes for two microbial genera, Geobacter and Desulfobacter, were designed with PROBEmer. For the first, a T-group was constructed from the RDP 16S rRNA database by the Sequence Extract module with the keyword ‘Geobacter’. Various input parameters in the Oligo Select module were tested; the best set of candidate probes was obtained with oligo length of 16 nt. Of these, probe 6020 (Table (Table2)2) appeared in 10 of 15 T-group sequences at an average position of 1407. (Because this T-group contained partial gene sequences, a probe may not be able to appear in all target sequences.) Probe 6020 had three NT-group matches of which two were unidentified clones. The nearest neighbor was a 3-off-match with an eight-base exact match. These probe characteristics were determined to be acceptable for experimental testing.

Table 2.
Experimental cross hybridization test comparing probes in the literature and probes developed with PROBEmer; the signals were normalized with respect to direct hybridization

For Desulfobacter, a similar procedure was applied. After using the keyword ‘Desulfobacter’ for the sequence extraction, manual editing of the T-group file was required to eliminate sequences with words such as ‘Thermodesulfobacterium’. The best probe 6019, at an average position of 633, matched 11 of 12 sequences in the T-group with no NT-group matches.

The quality of probes 6019 and 6020 was tested by a multiplexed, flow cytometric, bead-based assay (3). Two probes based on the literature, labeled ‘3005’ (Geobacter) (14) and ‘6015’ (Desulfobacter) (15), were also compared. The analyte consisted of >400 fmol of amplicons from a single control strain obtained by PCR primers flanking various regions within the 16S rRNA gene. Large amplicon concentrations were used to observe cross hybridization, if present. The probes were subjected to multiple analytes, of which five are shown in Table Table22 along with the PCR primers. For each probe, the data in each column were normalized such that the direct signal was set to 100%. Based on many tests with different control strains and primers, the direct signals varied between 960 and 3330 relative fluorescence units and were considered substantial in all cases.

As shown in Table Table2,2, the literature probe 3005 for Geobacter exhibited strong cross hybridization with Desulfotomaculum aeronauticum. The most likely cause—a 16-base exact match—was easily observed with the Oligo Check module. The redesigned probe 6020 exhibited negligible cross hybridization. The literature probe 6015 for Desulfobacter cross-hybridized with Desulfobulbus propionicus; by examining substrings of 6015, a 9-base exact match was found. The redesigned probe 6019 eliminated the cross hybridization.

Developing probes for a gene using a whole genome ORF database

Using PROBEmer, probes were found for the CDC28 gene in S.cerevisiae and were compared with results from Li and Stormo's ProbeSelect algorithm (7). In PROBEmer, the Sequence Extract module was used to extract the CDC28 sequence (897 nt) from the ORF database for S.cerevisiae (included in PROBEmer); the T-group consisted of a single sequence. The NT-group consisted of the ORF database with CDC28 excluded using a keyword. The oligo length was set to 24; the minimum target group match limit was 100%; the maximum non-group matches was zero; the maximum complementarity was 20; otherwise default input parameters were used. All 10 probes selected by Li and Stormo with base composition criteria applied were found by PROBEmer. Six of these probes had a nearest neighbor with six mismatches; two had seven, and one had five. The remaining probe had a nearest neighbor with four mismatches and a 19-base exact match with another ORF. Based on our experimental experience, this probe would have a strong potential to cross hybridize.

PROBEmer additionally found 266 probes with six mismatches and 23 probes with seven mismatches to the nearest neighbor. Depending on the position of the mismatches, the latter may be superior in discriminating the ORFS. More stringent criteria may be applied to reduce the list.

DISCUSSION

PROBEmer has been used with success in both experimental and computational tests. Because it does not rely upon multi-alignments, it can be applied to sequences which are not similar (e.g. genes in a single organism) or to variations of a gene in many organisms (e.g. 16S rRNA gene). However, there are certain limitations. Firstly, probe sequences and PCR primers in the literature can contain ambiguous bases. PROBEmer was not designed to allow ambiguities in the candidate oligos. Secondly, when determining the distance of the neighbors, indels have not been included in the analysis.

ACKNOWLEDGEMENTS

We wish to thank S. Dollhopf, J. Kostka, J. Chang and D.C. White for the control strains and C. Adams for maintaining the computers. This work was supported by NSF grant IIS-9820497 to A.L.D and by grants DOE NABIR DE-FG02-01ER63264 and NSF BES-0116765 to M.L.

REFERENCES

1. Benson D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2002) Genbank. Nucleic Acids Res., 30, 17–20. [PMC free article] [PubMed]
2. Service R.F. (1998) Microchip arrays put DNA on the spot. Science, 282, 396–399. [PubMed]
3. Spiro A. and Lowe,M. (2002) Quantitation of DNA sequences in environmental PCR products by a multiplexed, bead-based method. Appl. Environ. Microbiol., 68, 1010–1013. [PMC free article] [PubMed]
4. Nielsen H.B. and Knudsen,S. (2002) Avoiding cross hybridization by choosing nonredundant targets on cDNA arrays. Bioinformatics, 18, 321–322. [PubMed]
5. Xu D., Li,G., Wu,L., Zhou,Ji. and Xu,Y. (2002) PRIMEGENS: robust and efficient design of gene-specific probes for microarray analysis. Bioinformatics, 18, 1432–1437. [PubMed]
6. Li F. and Stormo,G.D. (2001) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 1067–1076. [PubMed]
7. Maidak B.L., Cole,J.R., Lilburn,T.G., Parker,C.T., Saxman,P.R., Farris,R.J., Garrity,G.M., Olsen,G.J., Schmidt,T.M. and Tiedje,J.M. (2001) The RDP-II (Ribosomal Database Project). Nucleic Acids Res., 29, 173–174. [PMC free article] [PubMed]
8. Peterson J.D., Umayam,L.A., Dickinson,T.M., Hickey,E.K. and White,O. (2001) The Comprehensive Microbial Resource. Nucleic Acids Res., 29, 123–125. [PMC free article] [PubMed]
9. Manber U. and Myers,E.W. (1993) Suffix arrays: a new method for on-line string searches. SIAM J. Comput., 22, 935–948.
10. Abouelhoda M.I., Kurtz,S. and Ohlebusch,E. (2002) The enhanced suffix array and its applications to genome analysis. Second Workshop on Algorithms in Bioinformatics, pp. 449–463. Springer-Verlag, Heidelberg, Germany.
11. Rozen S. and Skaletsky,H. (2000) Primer3 on the WWW for general users and for biologist programmers. In Krawetz S. and Misener S. (eds), Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp. 365–386. [PubMed]
12. Delcher A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30, 2478–2483. [PMC free article] [PubMed]
13. Tarhio J. and Ukkonen,E. (1993) Approximate Boyer-Moore string matching. SIAM J. Comput., 22, 243–260.
14. Snoeyenbos-West O.L., Nevin,K.P., Anderson,R.T. and Lovley,D.R. (2000) Enrichment of Geobacter species in response to stimulation of Fe(III) reduction in sandy aquifer sediments. Microb. Ecol., 39, 153–167. [PubMed]
15. Daly K., Sharp,R.J. and McCarthy,A.J. (2000) Development of oligo-nucleotide probes and PCR primers for detecting phylogenetic subgroups of sulfate-reducing bacteria. Microbiology, 146, 1693–1705. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...