In this issue


Third Party

Map Viewers

What’s the
Longest Sequence
in GenBank?

Structure Summaries

PubMed Central



New Microbial

Release 133



Searching the Trace Archive with Discontiguous MegaBlast

Modern sequencing technology has facilitated the production of a huge and growing volume of raw, unannotated nucleotide sequence for a variety of organisms. The rapidly expanding NCBI Trace Archive contains over 100 billion base pairs of such sequence from dozens of organisms. Making use of this unannotated sequence data requires the ability to compare these sequences to others, in particular, to the annotated sequences in the GenBank database. The sheer volume of data, however, makes it a challenge to perform sensitive comparisons quickly.

To maximize sensitivity when comparing coding sequences between organisms, translated searches are the best choice since they convert nucleotide sequences to their more tightly conserved protein translations before the comparisons are made. Because of the need to perform translations and comparisons in 6 reading frames, however, translated searches are very time-consuming.

Untranslated searches are more rapid but much less sensitive because codon usage differences between organisms allow similar proteins to be encoded by dissimilar nucleotide sequences. To facilitate sensitive untranslated searches, NCBI has developed a program called Cross Species MegaBLAST.

MegaBLAST uses an exact contiguous nucleotide or “word” match of length 28 as the starting point for constructing alignments. However, as the identity between the sequences to be compared dips below 80%, the requirement for a contiguous word hit leads to the omission of many statistically significant alignments with the concomitant generation of many short random alignments. Cross-Species MegaBLAST works on the principle that if alignments are initiated not with an exact contiguous word match, but with the match of an equivalent number of noncontiguous positions within longer segments of the sequence, fewer words are found, but a greater fraction of those found produce statistically significant alignments.1

An example of a discontiguous word is the “coding” template given below:


This template gives optimal results when comparing coding sequences across species. The positions occupied by the twelve “1”’s must match between the two sequences to be compared in order for MegaBLAST to begin an alignment. The template allows every third position to vary in accordance with the third “wobble base” of the genetic code. This discontiguous word is more sensitive than the corresponding contiguous word of length 12 for comparisons of coding regions across species where sequence divergence is high.

Figures 1 and 2 below show MegaBLAST graphical overviews for two searches of the mouse trace sequences using the transcript sequence of the human HEXA gene as the query. Both searches return results in similar lengths of time. However, the Cross-Species search is far more sensitive and is able to detect matches over about 75% of the human HEXA sequence.

Cross-Species MegaBLAST is found as a search option on the Trace Archive page under the “Cross Species Comparison” link. Both the “coding” template and an “optimal” discontiguous template for searches using non-coding sequences are supported.

Click on figure to view enlarged version

Figure 1: Graphical overview of the results of a MegaBLAST search using the transcript sequence of the human HEXA gene as a query against the mouse data in the Trace Archive.

Click on figure to view enlarged version

Figure 2: Graphical overview of the results of a Cross-Species MegaBLAST search using the transcript sequence of the human HEXA gene as a query against the mouse data in the Trace Archive.

1 Ma, B., Tromp, J., Li, M., “PatternHunter: faster and more sensitive homology search”, Bioinformatics 2002 Mar;18(3):440-5


NCBI News | Spring 2002 NCBI News