Updated assembliesincorporating new data filling in existing gaps and increasing overall accuracywill be released to the public on a regular basis. The human genome data can be viewed on the Web with NCBIs human genome Map Viewer or downloaded in bulk via FTP.
NCBIs assembly process starts with the entire complement of human
genomic sequence in GenBank, both draft and finished. Assembling and
ordering the individual sequence units is a critical phase of the Human
Genome Project. It involves many different steps including screening
for vector and other sequence contamination before merging the
input data into ordered segments of DNA referred to as contigs. This
first build presents more than 6000 contigs, representing roughly
2.8 billion base pairs. Nearly 700 contigs are longer than 1 MB. Over
75 percent of the bases in the contigs are in unbroken segments of greater
than 30Kb the size of a typical human gene.
NCBI is also engaged in the essential process of annotating, or labeling the biologically important areas of the human genomic sequence. Human gene annotation falls into two major tasks: the correct placement of known human genes into their proper genomic context; and the prediction of new previously unknown genes from the genomic sequence.
For the first task, the mRNAs from the NCBI RefSeq collection are placed on the genome primarily by alignment with compensation for various problems in both the genomic and mRNA sequences and reconciliation of close paralogs and pseudogenes. In this first release on the NCBI Web site 8800 of the 10500 RefSeq mRNAs were placed on the genome.
For the second task, multiple lines of evidence including EST alignments splice junctions protein similarities and other methods are combined to predict new genes. The predicted mRNAs and proteins will be subject to change with improved data and better algorithms. Nonetheless, NCBI will do its best to keep the same accession numbers with the same predicted genes from build to build. A new release containing both known gene placements and predicted gene models was in process as this article went to press.
biological features are also being annotated on the genomic sequence.
This first release includes more than 1.3 million SNPs and 111851
NCBIs human genome Map Viewer may be used to view the contigs used to assemble the sequence by selecting Contig map. SNP data may be viewed on the SNP map. The Map Viewer may be used to further explore the human genome data by viewing up to 7 parallel maps selected from a pallet of nineteen including 6 sequence maps 5 cytogenetic maps 2 genetic maps and 6 radiation hybrid maps.
The data is also available for downloading from the genomes/ H_sapiens directory of the NCBI FTP site.
FTP site includes the contigs produced by the NCBI assembly RefSeq
and model mRNA sequences annotated on the genome, and information used
by the Map Viewer to generate and display the palette of nineteen maps
mentioned above. DW,