| PubMed | Nucleotide | Protein | Genome | Structure | Taxonomy |
| Pan troglodytes - chimpanzee genome data and search tips | Revised May 2, 2007 |
|
The Map Viewer help document describes how to use the Map Viewer software. This page describes the data available for Pan troglodytes (chimpanzee), and the search tips specific to that organism. You can also return to the Pan troglodytes genome view search page. The Map Viewer home page allows you to search the genome data of any organism represented in MapViewer. |
|
|
|
The Map Viewer provides a view of chimpanzee data from sequence maps, described below. Separate documents provide an introduction to the information infrastructure developed at NCBI to integrate the various types of data generated by the Chimpanzee Genome Project, including release notes for each build of the genome, and statistics for the current build. |
| Chimpanzee Genomic Sequence Data |
|
|
The current chimpanzee genome build is a composite of whole genome shotgun sequences from clones isolated from Yerkes chimp pedigree #C0471 (Clint). Funding for the sequencing project was provided by NHGRI. The initial assembly was made public and subsequently aligned to the human genome on December 10, 2003, as announced by NHGRI. The Pan troglodytes whole genome shotgun (WGS) projects were submitted by two groups, the Genome Sequencing Center at the Washington University School of Medicine in St. Louis and The Broad Institute. The Washington University (WGS) project has the project accession AACZ00000000 and is comprised of accessions AACZ01000001-AACZ01435593. The Broad Institute (WGS) project has the project accession AADA00000000 and is comprised of accessions AADA01000001-AADA01361864. On March 16, 2006, the Chimpanzee Sequencing and Analysis Consortium released a new 6X WGS draft assembly. This assembly is comprised of accessions AACZ02000001-AACZ02246375 and includes sequence from the initial 4X assembly along with an additional 2X WGS plasmid reads generated by the Genome Sequencing Center at the Washington University School of Medicine. The assembly currently displayed and annotated in Map Viewer is the 6X assembly from the Chimpanzee Sequencing and Analysis Consortium. Contigs were assembled using the human genome as a guide, and are therefore "humanized" in their construction. This is an important distinction, as some sequences, such as insertions, deletions, and gene duplications, may not be accurately represented by the current chimpanzee assembly. We have adopted the NEW chimpanzee chromosome naming system as proposed by McConkey, 2004 and endorsed by the International Chimpanzee Genome Consortium. This new system renames the chimpanzee chromosomes to correspond to their syntenic human chromosomes. The new system will make comparisons between chimpanzee and human genomes easier to understand. The new chimpanzee chromosome naming compared to the human chromosome and original chimpanzee chromosome names can be seen in this table. Watanabe et al., 2004 published the sequence of chimpanzee chromosome 22. Their sequence assembly, ICC22Cv1 (The International Chimpanzee Chromosome 22 Consortium chromosome 21 assembly version 1), was included in build 1 as an alternate assembly. In build 2 it is included in the reference assembly as chromosome 21 and replaces the corresponding WGS sequences of the previous assembly (build 1.1). A 5-Mb region of chromosome 7 was finished by the Genome Sequencing Center at the Washington University School of Medicine in collaboration with David Page's group at the Whitehead Institute. This finished region replaces the corresponding WGS sequences of the previous assembly (build 1.1). Kuroki et al., 2006 published the sequence of chimpanzee chromosome Y. The assembly of this sequence, CCYSCv1 (The Chimpanzee Chromosome Y Sequencing Consortium assembly version 1), was included in build 2 as an alternate assembly. This chromosome Y alternate assembly sequence came from a single male chimpanzee (Gon) of the supbspecies Pan troglodytes verus while the chromosome Y reference assembly sequence came from a single male captive-born chimpanzee (Clint) of the species Pan troglodytes. chr6_hla_hap1 is an alternate assembly of the chimpanzee chromosome 6 HLA region that aligns better to an alternate haplotype of the human chromosome 6 HLA region than to the HLA region of the human reference assembly. |
| Chimpanzee BLAST Databases |
|
| The complete set of chimpanzee sequence databases available for BLAST searching are shown on the Chimpanzee BLAST page, which includes a link to the database descriptions. |
| Additional Chimpanzee Genome Resources |
|
| In addition to the Pan troglodytes data available in the Map Viewer and through BLAST, information is available from the Chimpanzee Genome Resources Guide, which includes links to NCBI resources and external resources pertaining to genomic sequence, maps and annotation. An explanation of the Gnomon gene prediction processing can be found here. In addition, the NCBI Handbook includes a series of exercises that demonstrate additional questions that can be answered with Map Viewer. |
|
|
| Sequence-Based Maps |
|
| Ab initio | Shows models generated by Gnomon. mRNA alignments were used to segment the genomic sequence by putative gene boundaries, and Gnomon was executed on these segments to predict genes. Gnomon uses protein alignments in addition to transcript alignments and, in order to capture as much coding information in the genome as possible in this assembly, Gnomon models may represent partial as well as complete coding sequences. Models with a completely supported CDS are blue, models with a partially supported CDS are green, and the pure ab initio predictions are brown. Those ab initio predictions with e values <0.0001 are indicated as dark brown on the map and all other ab initio predictions are shown in light brown. Pure ab initio status indicates that the model was built without the support of mRNA or protein alignments, either through failure to align the sequence to the genome or an alignment ignored by Gnomon due to a score falling below a pre-determined threshold. |
| Assembly | Allows users to visualize all sequence data available for a given region of the genome, and separates data by assembly. There are currently three assemblies available for chimpanzee:
When viewing the assembly map, a blue vertical line indicates the assembly that is being viewed. The reference assembly is shown as the blue line by default. The orange vertical line shows a region of the genome where sequence data from other assemblies is available. By default, the other sequence maps that are available for an organism display features that have been annotated on the reference assembly. The Maps&Options dialog box allows you to change the assembly being displayed. Instructions on how to do this are provided in the Select One or More Assemblies to Display section of the general Map Viewer help document. |
| Component | The component map provides the tiling path of GenBank accessions used to build each "NT_xxxxxx" contig, and the tiling path of GenBank "AACZ02xxxxxx" accessions from the Pan troglodytes whole genome shotgun project (AACZ00000000.2) used to build the NW_xxxxxx WGS contigs, which are described below. |
| Contig | Shows the chromosomal placement of NW_xxxxxx contigs on the chimpanzee assembly of whole genome shotgun (WGS) data. The individual GenBank records used to assemble the contigs are shown on the Component map, described above. |
| CpG Island | Shows regions of high G + C content on the assembled genome sequence. Two sets of criteria were used for finding CpG islands: "strict" and "relaxed," described below. The algorithm (and cutoffs) were taken from Takai and Jones, 2002.
|
| GenBank_DNA | Shows the placement of chimpanzee genomic DNA sequences from GenBank that were not used in the assembly of contigs. The placement is based on the alignment of the sequences to the components of the contigs. It includes genomic sequences longer than 500 bp that have at least 97% identity to the components for at least 98 base pairs. If a sequence extends beyond a contig, that portion of sequence is not shown. The 'hits' link leads to a tabular display that shows the matching regions (base spans) of the assembly component and the GenBank genomic DNA record that has been aligned to it. Orange lines represent unfinished (phase 1 and 2) HTGs sequences that have been aligned to the assembled genome. Blue lines represent other genomic DNA records that have been aligned to the assembled genome. The length of a line represents the upper and lower-most points on the genome assembly to which sequence fragments from a single GenBank record were aligned. Thick parts of a line represent fragments of sequence from a GenBank record that have been aligned to the assembled genomic sequence, and the thin parts of a line connect the fragment that come from a single GenBank record. When the GenBank_DNA map is displayed as the master map, in the default verbose mode, the descriptive text includes a bases column, which shows the total number of bases in the GenBank record that was aligned to the genome, and a status column, which shows the total number of bases from that record that were aligned to the genome, how many separate pieces of sequence from that record were aligned, and whether those pieces were shuffled to make the alignment. |
| Genes_Sequence | Genes that have been annotated on the genomic contigs. This includes known and putative genes placed as a result of alignments of mRNAs to the contigs. If multiple models exist for a single gene, corresponding to splicing variants, the Genes_Sequence map presents a flattened view of all the exons that can be spliced together in various ways. For example, if one splice variant uses exons 1, 3, 4, and another splice variant uses exons 2, 3, 4, the Genes_Sequence map shows exons 1, 2, 3, 4. Genes shown on the left of the grey line are transcribed in the - orientation (from bottom up), and those on the right in the + orientation (from top down). When Genes Sequence is selected as the Master map, the verbose display (detailed labeling, shown by default) includes arrows to the right of each gene name indicate its direction of transcription as well as links to:
Gene models are shown in five colors, depending on the type of evidence that was used to construct the models. The one or two letter code shown in the evidence column (that is displayed when Gene_Sequence is the master map) also indicates the type of evidence. |
|
|
Additional Notes: In general, a gene model is shown in blue if there is a clean alignment between a RefSeq or GenBank mRNA sequence and the genomic sequence, and if there is an exact match between the protein product that was annotated in the mRNA sequence record and the conceptual translation of the genomic sequence gene model. A gene model is shown in orange if there is some discrepancy between the mRNA sequence and the gene model, either in the alignment of the two and/or in their protein products. Examples of the former can include gaps, or the alignment of an mRNA to two or more genomic regions. Examples of the latter can include differences between the amino acid sequence given in an mRNA sequence record and the conceptual translation of the corresponding gene model, or premature termination of a coding region in the genomic sequence. Both of those can be caused by base pair mismatches between the mRNA and genomic sequence. Models with Interim LocusIDs (evidence code I) may be paralogs, genes not yet curated, duplications because of assembly errors, or pseudogenes. The genome assembly and annotation pipeline assigns interim IDs when there is no unambiguous solution to what they should be. Interim LocusIDs are always associated with a RefSeq XM_* accessions (model mRNAs), although supporting alignments may (or may not) include RefSeq NM_* accessions (known mRNAs). More about RefSeq and RefSeq accessions... |
| RefSeq RNA | Diagrams of the RNAs that are predicted on the genomic contigs. The RefSeq RNA map and Genes Sequence map are built in the same way; however, the Genes Sequence map shows a view of all the exons in a gene, while the RefSeq RNA map could potentially show the combinations of exons (i.e., splice variants) that are valid, if mRNA sequences indicate alternative splice variants. |
| Repeats | Position of repetitive elements.
The RepeatMasker was used to illustrate areas within the genome that contain interspersed repeats and low complexity DNA sequences. |
| RNA Maps | The RNA maps show mRNA and EST sequences from a given organism aligned to the assembled chimpanzee genomic sequence that has been repeat-masked and dusted. Only ESTs supplied with orientation are used. Each alignment is the single best placement for that sequence in the current build of the chimpanzee genome. It can be queried by sequence accession. The RNA maps include:
The display for RNA maps differs from those labeled as Xx_UniGene in that what are displayed here are the alignments [thicker lines] and putative introns [thinner lines] of ESTs and longer mRNAs best placed at that position. Green lines indicate ESTs; blue indicates cDNAs. In contrast, the "UniGene" map is a summary of probable splicing events, with connections to UniGene for the clusters that contain those sequences. |
| STS | Placement of STSs from a variety of sources onto the assembled genomic sequence (the NW_xxxxxx contigs, described above) using Electronic-PCR (e-PCR). |
| Unigene Maps |
The UniGene maps show mRNA and EST sequences from a given organism aligned to the assembled genomic sequence that has been repeat-masked and dusted. Only ESTs supplied with orientation are used. Each alignment is the single best placement for that sequence in the current build of the genome. ESTs are clustered based on shared introns and alignment to a common position on the genome. Those ESTs can come from one or more UniGene clusters, whose IDs are noted by the EST cluster. (UniGene clusters are made with a different build procedure, so there is not necessarily a one-to-one correspondence between EST clusters on the UniGene map and clusters in the UniGene resource.) The display of the UniGene map varies according to the span of sequence being displayed. For large spans of sequence (greater than 10 million bases), the Map Viewer displays histograms that show the density of ESTs and mRNAs aligned to a region, the UniGene clusters to which they belong, and the number of sequences from each UniGene cluster. For smaller spans of sequence (i.e., higher resolutions, showing less than 10 million bases), the Map Viewer displays the above information plus blue lines that indicate exon/intron structure:
Alignments are grouped by common structure. If two or more transcripts share at least one intron/exon splice junction, the alignments of those transcripts are merged into a single model. If two or more transcripts do not share any intron/exon splice junction, they are shown as separate models. The unigene maps include:
|
|
|
| Searchable Terms |
|
The Map Viewer supports searching on any term that describes an element on any map, including:
|
| Map Positions |
|
As noted in the Search By Position section of the Entrez Map Viewer general help
document,
there are three main ways to search by map position from the
Map View of a chromosome:
It is not necessary to enter a value in both Region text boxes. If you enter a value only in the upper box, the Map Viewer will display the region of the chromosome starting from that point and ending at the lower end of the chromosome. If you enter a value only in the lower box, the Map Viewer will display the region of the chromosome starting at the upper end of the chromosome and ending at the value entered.
|