Data Sources
Non-Sequence-based Maps. Sources of maps that are not based directly on sequence include published maps in genetic, radiation hybrid, cytogenetic, and ordinal coordinate systems (where ordinal refers to clone order). The primary sources of each map are described in the online help documentation of each genome-specific Map Viewer. We are indebted to the researchers who make their mapping results so freely available. When a new version of any map becomes available, the data are also updated in the appropriate NCBI database.
Table 1
Types of Map Viewer annotation provided by NCBI
| STS | Sequence (Mb), Radiation hybrid (cRay), Genetic (cM), Clone content (ordinal), Cytogenetic | STS, STSnw, G3, GM4, GeneMap'99, TNG, Marshfield, Genethon, deCode, Whitehead YAC, phenotype maps such as Quantitative Trait Loci (QTL) |
| Clones | Sequence, Cytogenetic | Clone, BES, Components |
Expression | Sequence | SAGE tag, UniGene |
| Genes | Sequence (Mb), Cytogenetic (band names) | Genes_seq, Genes_cyto |
Gene-related | Sequence, Cytogenetic | UniGene, GenomeScan, Mitelman recurrent breakpoint, morbid |
| Variation | Sequence (Mb) | Variation |
Published accessions | Sequence (Mb) | GenBank |
Phenotype | Cytogenetic, Cytogenetic (abnormalities), Sequence | OMIM's morbid map, Mitelman's recurrent breakpoint, QTL (in progress) |
| Source clones | Sequence (Mb) | Component |
Homology | Sequence (Mb) | Indirectly via LocusLink or UniGene. For mouse and human, through the homology (hm) link to the mouse–human homology map |
Sequence-based Maps. The sequence-based maps shown through
Map Viewer can be supplied by external sources and/or supplied from features computed within
NCBI. For example, when the annotated sequence for a complete genome is submitted to the sequence databases (
GenBank/
EMBL/
DDBJ), a copy of the data may also be accessioned as Reference Sequences (
RefSeqs; see
Chapter 18). The gene, transcript, and other feature annotations of the submitted complete genome are processed for display in the
Map Viewer.
NCBI staff may then calculate and display the position of other types of features, such as marker position or points of variation, as separate maps (
Table 1).
Table 2
NCBI data resources used in NCBI-generated annotation
| Clone Registry | Clone sequencing sequence status, STS content, and availability |
| dbSNP | Single Nucleotide Polymorphisms (SNPs), polymorphisms, small-scale insertions/deletions, polymorphic repetitive elements |
| Genome Guides | Directory of key resources for the genome, with links to related resources and tutorials. The directory to guide pages is available from Genomic Biology. |
| LocusLink | Locus-specific data for a subset of organisms with extensive links to related resources and sequence data |
| OMIM | Human genes and Mendelian disorders |
| RefSeq | NCBI's curated, non-redundant RefSeqs |
| UniGene | Computed clusters of cDNA and Expressed Sequence Tag (EST) sequences from the same gene, with tissue expression information and links to related resources |
| UniSTS | Unified, nonredundant database of sequence tagged sites (STSs) |
Some of the annotation of genomic sequence carried out by
NCBI is included in the genomic reference sequences (NC, NT, and NW
Accession number format); however, other annotation is represented only in the
Map Viewer and in the associated reports (
Table 1). This latter type of annotation is based on information in several
NCBI databases (
Table 2) and is particularly important for attaching biological information to sequence data. Links to these resources are provided in
Map Viewer to provide further information about each annotated object. It should be noted, however, that although sequence features may be placed in a genomic context automatically, there are curation steps that affect the final displays. For example, for the human and mouse genomes, sequences defining genes and
pseudogenes are reviewed by collaborators and
NCBI staff and, whenever possible, used as the basis of
RefSeq records (NG, NM, and NR
Accession number format).
Feature annotation is computed primarily in two ways: (1) by alignment of the defining sequence to the genome; or (2) for sequence tagged sites (STSs), by e-PCR (1). In some genomes, gene placement is based primarily on the alignment of mRNA [Expressed Sequence Tags (ESTs) and cDNAs], but only when an encoded protein is predicted. In other cases, where transcription evidence is weaker, more weight is given to identification of protein-coding regions. Gene identification is also constrained in that a known gene cannot be placed more than once in a haplotype (except for pseuodo-autosomal regions) or on an incorrect chromosome. Thus, if any reference haplotype retains inappropriately redundant sequence that encodes a gene, only one copy will be annotated as that gene. Others will be assigned interim IDs (see Chapter 14). Some ab initio methods may also be used for gene prediction. The predicted genes, as well as the mRNAs, are supplied as separate maps (gene, RNA, or GenomeScan maps).
In some cases, the position of these features may suggest the location of other genomic regions of interest. For example, the position of STS markers can help define the position of phenotypes such as quantitative trait loci (QTL). Although the best annotation of a gene or region is always through annotation by an expert researcher, automated annotation of genomes and comparison to that provided by experts can provide significant useful information. Experts interested in analyzing or assisting with genome annotation should contact us at info@ncbi.nlm.nih.gov.
Relationships among Coordinate Systems
In addition to supporting the display of multiple maps in the same coordinate system (e.g., multiple sequence-based maps), Map Viewer also displays maps in different coordinate systems by calculating the correspondances among them (e.g., sequence to genetic). This is accomplished by: (a) identifying features that have been placed on maps in different coordinate systems; and (b) using general conversion factors. In the first case, placement of STSs on the genome is critical for the integration of sequence data with other, non-sequence-based maps, such as genetic and RH maps. The integration of cytogenetic data with sequence data is achieved through alignment of sequence from clones that have been placed cytogentically, such as the human fluorescence in situ hybridization (FISH)-mapped clones from the Bacterial Artificial Chromosome (BAC) Resource Consortium (2). The integration of non-sequence-based maps with the sequence provides a powerful mechanism to access portions of sequence on the basis of marker or cytogenetic data. Many features, such as Single Nucleotide Polymorphisms (SNPs), ESTs, mRNAs, whole genome shotgun reads, and clones can be placed on the genome assembly by using standard DNA sequence alignment methods such as BLAST.
The identification of known genes within the genome assembly provides critical landmarks and functional context to the sequence data, which in turn makes it easier to traverse to other rich sources of gene and protein information, including publications, OMIM, RefSeq, Conserved Domain Database (CDD), and LocusLink.
The power of calculating correspondances between coordinate systems may be more apparent when considering a common application of Map Viewer, i.e., identifying candidate genes within a region defined by genetic markers. When markers are palced on both genetic and sequence maps, it is then possible to use the gene-related maps (gene, UniGene/EST, or ab initio predictions) to identify possible genes of interest. For more details on how to do this, see the Map Viewer Exercises in Chapter 23.
A Work in Progress
Figure 1
.
Evaluation of a chromosome sequence (STS) map
Potential inconsistencies in the order or orientation of sequence blocks can be investigated by displaying a genetic map (Marshfield), radiation hybrid map (TNG), and sequence map (STS) together and checking the Show connections box in the Maps & Options window. Note that some of the gray lines (connecting the same marker on different maps) are crossed, indicating that either the placement is incorrect on a map or the chromosome sequence is not ordered and oriented consistently with all map data.
Figure 2
.
Evaluation of gene localization and annotation
A comparison of cDNA alignments (UniGene, RNA) and gene predictions (GenomeScan) to the genomic contig annotation can be achieved by displaying three maps simultaneously. The genomic contig (NT_024981.9) annotation is shown in the Genes_seq map and is displayed with the GenomeScan predictions (the GScan map) and the EST/mRNA alignments labeled by human UniGene clusters (the UniG_Hs map). Note that in this case, there are two sequence objects not included in the contig annotation: one is an ab initio prediction (the last model in the GScan map) (a); and the other is either some small gene or an alternative 3′ exon for PIK3C3 from the UniG_Hs map (b). This approach is especially useful when reviewing BLAST results in a genomic context.
For many genomes, identifying and positioning chromosomes and genes within sequence blocks is an ongoing process. In those cases, the
Map Viewer can be used to evaluate the evidence that supports the current representation of the sequence and visualize possible conflicts. Inconsistencies in map order or in the placement of any object can be seen in the
Map Viewer; this is assisted in some cases by the use of color coding (Figures and ).
For some genomes, the color-coded contig map displays whether the annotation is based on sequence assembled from draft or finished clones (blue, finished; green, whole genome shotgun; orange, draft). This is helpful when evaluating the level of confidence in the completeness of the annotation of a gene and/or its coding region.
Figure 3
.
Representation of ambiguity
(a) The marker D1S2894 is found on several maps. Note that for the first map (STS), the circle is diagonally split with two colors. The diagonal means that the marker has been placed more than once; the two colors mean that the placements are not on the same chromosome. (b) A Map Viewer display of a region of chromosome 16. SNPs that are placed more than once on the chromosome are designated by a yellow triangle. From the Contig map, it appears that at least one of these SNPs (rs3220808) is placed both on draft sequence (orange) and on finished sequence (blue). This may be an artifact resulting from misassembly or perhaps a region of segmental duplication. This diagram also illustrates the use of color to indicate the source and level of confidence in annotated genes. Blue indicates a confirmed gene with no conflicts; light green indicates EST evidence only; dark brown indicates a GenomeScan prediction with protein homology; orange means that there is a conflict between the annotated gene and the mRNA evidence. (Ab initio predictions from GenomeScan are categorized into two types, based on presence or absence of sequence similarity to vertebrate proteins or protein domains.)
Map Viewer also uses color coding or diagrams to represent the level of confidence in the placement of any mapped object. For example,
SNPs or
STSs that are placed at more that one position in a given map are noted by color (yellow) in the detailed labels (). Annotated genes are shown in different colors, based on the source and level of confidence in the annotation or the model ().