The [numbers] refer to my bibliography.
Collaborators
The gapped local alignment score of two random sequences follows a modified Gumbel statistical distribution. If the distribution’s parameters could be estimated within one second, arbitrary alignment scoring schemes could improve the sensitivity of searching sequence databases (e.g., by accounting for query composition). Accordingly, with the aim of permitting arbitrary scoring schemes to be used in BLAST web services, many researchers have proposed various accelerations to the computation of modified Gumbel parameters. Before our work, at practical accuracies, the fastest method of calculating all the parameters took 2 days, and even the fastest method of calculating only the most important parameter (l) took 5 mins. We have prototype code that calculates all the modified Gumbel parameters to practical accuracies in less than 1 sec.
Research Plan: To implement practical working code for calculating the BLAST statistical parameters in less than 1 sec.
Collaborators
“Barcoding” identifies the species of an organism by standardizing and then sequencing genetic loci of less than 1 kb. The choice of barcodes is problematic in plants because they evolve more slowly than most other organisms. In response to requests from the Smithsonian Institute to provide bioinformatics support, I evaluated the performance of their primary candidate barcodes (e.g., matK, rbcL, and trnH-psbA) for identifying plant species with two metrics, sequence alignment and Kimura distance. First, we assembled a database consisting of GenBank sequences of the candidate plant barcode loci. Second, we calculated nearest neighbors in the sequence database under the various metrics. Third, by extracting each sequence from the database in turn and using its nearest neighbor's distance to classify it, we estimated the probability of a correct answer to the question: does another member of the extracted sequence's taxon remain in the database? We also estimated the probability of correct classification. Partly as a result of my presentation at the 2nd International Conference on Barcoding in Taipei 2007, trnH-psbA has been chosen as a plant barcode.
Research Plan: To develop and exploit new specializations of U-statistics to evaluate combinations of plant barcodes; and to develop (with possible spin-offs to Project 4) the multiple alignment strategies required to permit species identification with the trnH-psbA barcode.
Collaborators
The biological aim of this project was mainly the location of transcription binding sites (TFBSs), but its methods are quite general and apply to any ungapped sequence motif. With “gold” standard datasets and rigorous statistical tests, we showed that Markov models [80] and positional information [submitted manuscript] improve TFBS prediction significantly. Moreover, we showed that the Markov models used in extant TFBS programs is inferior, both theoretically and practically, to the theoretically correct Markov model we proposed. We also showed that positional information finds novel TFBS octonucleotide motifs [76, voted Best Paper at ISMB 2005 Conference] and that a single octonucleotide motif can have different biological functions, depending on its position relative to the transcription start site [submitted manuscript]. Positional information and the theoretically sound Markov models have been incorporated into the publicly available motif-finding program A-GLAM.
Research Plan: To continue to demonstrate the importance of position in elucidating the functional distinctions between identical sequence motifs, particularly ones participating in regulatory modules.
Collaborators
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.
Research Plan: To implement heuristics in GLOBAL, to make its p-value practical for use in a PSI-BLAST-like computer program for iterative domain alignments, and to make GLOBAL a practical alternative to the local alignment rps-BLAST program for CDD searches at NCBI.
Collaborators
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. Inexact simple repeats can be quantified as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic Gumbel distribution, which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits the Gumbel parameters to be computed from matrices whose dimension equals the size of the relevant alphabet. My results for ungapped repeats are more precise than those from G. Achaz et al. (2006) “Repseek, a tool to retrieve approximate repeats from large DNA sequences.” Bioinformatics. A working prototype of program for finding repeats with the Ruzzo-Tompa algorithm and then evaluating the results is available internally, within NCBI.
Research Plan: To seek opportunities to implement the theory in a fully practical computer program.
Last update: Jan 14, 2007
Comments or questions to John L. Spouge (spouge@ncbi.nlm.nih.gov)
Credits: John L. Spouge