Data & Software
Databases
- NCBI Website Search
- A database of static NCBI web pages, documentation, and online tools. These pages include such content as specialized online sequence analysis tools, back issues of newsletters, legacy resource description pages, sample code, and other miscellaneous resources.
Tools
- ASN.1 Format Summary
- An International Standards Organization (ISO) data representation format used to achieve interoperability between platforms.
- CDTree
- A stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own. CDTree is tightly integrated with Entrez CDD and Cn3D, and allows users to create and update protein domain alignments.
- Cn3D
- A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.
- E-Utilities
- Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.
- NCBI DTDs
- A listing of all the DTDs that NCBI uses.
- NCBI Toolbox
- A set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read records in Abstract Syntax Notation 1 (ASN.1) format, an International Standards Organization (ISO) data representation format.
- ProSplign
- A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.
- Splign
- A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.
Downloads
- BLAST (Stand-alone)
- BLAST executables for local use are provided for IRIX 6.2, Solaris 2.6, DEC OSF1 (ver. 4.0d), LINUX, and Win32 systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading.
- Cn3D Installation Page
- A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.
- FTP: BLAST Databases
- Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.
- FTP: CDD
- This site provides full data records for CDD, along with individual Position Specific Scoring Matrices (PSSMs), mFASTA sequences and annotation data for each conserved domain. See the README file for full details.
- FTP: COG
- This site contains data from the COG database. See the readme file for information about the content and organization of the files.
- FTP: dbGAP Open-Access Data
- Open-access data generally include summaries of genotype/phenotype association studies, descriptions of the measured variables, and study documents, such as the protocol and questionnaires. Access to individual-level data, including phenotypic data tables and genotypes, requires varying levels of authorization.
- FTP: dbMHC Data
- This site contains data in separate directories for the various projects and resources within the database of human major histocompatibility (dbMHC).
- FTP: FASTA BLAST Databases
- Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.
- FTP: GenBank
- This site contains files for all sequence records in GenBank in the default flat file format. The files are organized by GenBank division, and the full contents are described in the README.genbank file.
- FTP: Gene
- This site contains three directories: DATA, GeneRIF and tools. The DATA directory contains files listing all data linked to GeneIDs along with subdirectories containing ASN.1 data for the Gene records. The GeneRIF (Gene References into Function) directory contains PubMed identifiers for articles describing the function of a single gene or interactions between products of two genes. Sample programs for manipulating gene data are provided in the tools directory. Please see the README file for details.
- FTP: Gene Expression Nervous System Atlas (GENSAT)
- This site contains GENSAT image data organized by gene and contributing institution.
- FTP: Gene Expression Omnibus (GEO) Profiles and Datasets
- This site contains GEO data in two formats: SOFT (Simple Omnibus in Text Format) and MINiML (MIAME Notation in Markup Language). Summary text files and supplementary data are also available. Please see the README.TXT file for more information.
- FTP: Genome
- This site contains genome sequence and mapping data for organisms in Entrez Genome. The data are organized in directories for single species or groups of species. Mapping data are collected in the directory MapView and are organized by species. See the README file in the root directory and the README files in the species subdirectories for detailed information.
- FTP: Genome Markers (UniSTS)
- This directory contains text and XML files for UniSTS records along with mapping data.
- FTP: GenPept
- The protein sequences corresponding to the translations of coding sequences (CDS) in GenBank are collected for each GenBank release..Please see the README file in the directory for more information.
- FTP: GENSAT
- This site contains GENSAT image data organized by gene and contributing institution.
- FTP: HomoloGene
- This site contains data for each build of HomoloGene, beginning with build 35. Complete data for each build are provided in XML, and a data summary is provided in tab-delimited text format.
- FTP: Mapping Data
- Contains directories for each genome that include available mapping data for current and previous builds of that genome.
- FTP: NCBI Taxonomy
- This site contains the full taxonomy database along with files associating nucleotide and protein sequence records with their taxonomy IDs. See the taxdump_readme.txt and gi_taxid.readme files for more information.
- FTP: Protein Clusters
- This site contains data from the Protein Clusters database arranged by release date. See the README files for more information.
- FTP: PubChem
- This site provides data from the PubChem Substance, Compound and Bioassay databases for download via ftp. Full downloads of the databases are available along with daily, weekly and monthly updates for Substance and Compound. Substance and Compound data are provided in ASN.1, SDF and XML formats. See the README files for more information.
- FTP: RefSeq
- This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The "release" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.
- FTP: Sequence Read Archive Download Facility
- This site contains next-generation sequencing data organized by the submitted sequencing project.
- FTP: Site
- FTP download site for NCBI databases, tools, and utilities.
- FTP: SKY/M-Fish and CGH Data
- This site contains SKY-CGH data in ASN.1, XML and EasySKYCGH formats. See the skycghreadme.txt file for more information.
- FTP: Structure (MMDB)
- This site contains ASN.1 data for all records in MMDB along with VAST alignment data and the non-redundant PDB (nr-PDB) data sets. See the README file for more information.
- FTP: Trace Archive
- This site contains trace data organized by species. Data include trace and quality data in FASTA format, along with ancillary information in tab-delimited text and XML. See the README file for details.
- FTP: UniGene
- This site contains individual directories for each organism with data in UniGene. The data for each species includes the unique sequence for each UniGene cluster, all sequences in each cluster in FASTA format and library information for the cluster. See the README file for further details.
- FTP: UniVec
- This site contains the UniVec and UniVec_Core databases in FASTA format. See the README.uv file for details.
- FTP: Whole Genome Shotgun Sequences
- This site contains whole genome shotgun sequence data organized by the 4-digit project code. Data include GenBank and GenPept flat files, quality scores and summary statistics. See the README.genbank.wgs file for more information.
- RSS Feeds
- Subscribe to Web/RSS feeds for updates about NCBI resources.
Submissions
- tbl2asn
- A command-line program that automates the creation of sequence records for submission to GenBank using many of the same functions as Sequin. It is used primarily for submission of complete genomes and large batches of sequences.