Often used in conjunction with Cn3D is the Vector Alignment Search Tool (VAST; Refs. 3, 4). VAST is used to precompute “structure neighbors” or structures similar to each MMDB entry. For people that have a set of 3D coordinates for a protein not yet in MMDB, there is also a VAST search service. The output of the precomputed VAST searches is a list of structure records, each representing one of the Non-Redundant PDB chain sets (nr-PDB), which can also be downloaded. There are four clustered subsets of MMDB that compose nr-PDB, each consisting of clusters having a preset level of sequence similarity.
The structures within MMDB are now being linked to the NCBI Taxonomy database (Chapter 4). Known as the PDBeast project, this effort makes it possible to find: (1) all MMDB structures from a particular organism; and (2) all structures within a node of the taxonomy tree (such as lizards or Bacillus), which launches the Taxonomy Browser showing the number of MMDB records in each node.
The second database within the Structure resources is the Conserved Domain Database (CDD; Ref. 5), originally based largely on Pfam and SMART, collections of alignments that represent functional domains conserved across evolution. CDD now also contains the alignments of the NCBI COG database, the NCBI Library of Ancient Domains (LOAD) along with new curated alignments assembled at NCBI. CDD can be searched from the CDD page in several ways, including by a domain keyword search. Three tools have been developed to assist in analysis of CDD: (1) the CD-Search, which uses a BLAST-based algorithm to search the position-specific scoring matrices (PSSM) of CDD alignments; (2) the CD-Browser, which provides a graphic display of domains of interest, along with the sequence alignment; and (3) the Conserved Domain Architecture Retrieval Tool (CDART), which searches for proteins with similar domain architectures.
All the above databases and tools are discussed in more detail in other parts of this Chapter, including tips on how to make the best use of them.
To build MMDB (1), 3D structure data are retrieved from the PDB database (6) administered by the Research Collaboratory for Structural Bioinformatics (RCSB). In all cases, the structures in MMDB have been determined by experimental methods, primarily X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. Theoretical structure models are omitted. The data in each record are then checked for agreement between the atomic coordinates and the primary sequence, and the sequence data are then extracted from the coordinate set. The resulting agreement between sequence and structure allows the record to be linked efficiently into searches and alignment displays involving other NCBI databases.
The data are converted into ASN.1 (7), which can be parsed easily and can also accept numerous annotations to the structure data. In contrast to a PDB record, a MMDB record in ASN.1 contains all necessary bonding information in addition to sequence information, allowing consistent display of the 3D structure using Cn3D. The annotations provided in the PDB record by the submitting authors are added, along with uniformly defined secondary structure and domain features. These features support structure-based similarity searches using VAST. Finally, two coordinate subsets are added to the record: one containing only backbone atoms, and one representing a single-conformer model in cases where multiple conformations or structures were present in the PDB record. Both of these additions further simplify viewing both an individual structure and its alignments with structure neighbors in Cn3D. When this process is complete, the record is assigned a unique Accession number, the MMDB-ID (Box 1), while also retaining the original four-character PDB code.
After initial processing, 3D domains are automatically identified within each MMDB record. 3D domains are annotations on individual MMDB structures that define the boundaries of compact substructures contained within them. In this way, they are similar to secondary structure annotations that define the boundaries of helical or β-strand substructures. Because proteins are often similar at the level of domains, VAST compares each 3D domain to every other one and to complete polypeptide chains. The results are stored in Entrez as a Related 3D Domain link.
To identify 3D domains within a polypeptide chain, MMDB's domain parser searches for one or more breakpoints in the structure. These breakpoints fall between major secondary structure elements such that the ratio of intra- to interdomain contacts remains above a set threshold. The 3D domains identified in this way provide a means to both increase the sensitivity of structure neighbor calculations and also present 3D superpositions based on compact domains as well as on complete polypeptide chains. They are not intended to represent domains identified by comparative sequence and structure analysis, nor do they represent modules that recur in related proteins, although there is often good agreement between domain boundaries identified by these methods.
After initially processing the PDB record, structure staff add a number of links and other information that further integrate the MMDB record with other NCBI resources. To begin, the sequence information extracted from the PDB record is entered into the Entrez Protein and/or Nucleotide databases as appropriate, providing a means to retrieve the structure information from sequence searches. As with all sequences in Entrez, precomputed BLAST searches are then performed on these sequences, linking them to other molecules of similar sequence. For proteins, these BLAST neighbors may be different than those determined by VAST; whereas VAST uses a conservative significance threshold, the structural similarities it detects often represent remote relationships not detectable by sequence comparison. The literature citations in the PDB record are linked to PubMed so that Entrez searches can allow access to the original descriptions of the structure determinations. Finally, semiautomatic processing of the “source” field of the PDB record provides links to the NCBI Taxonomy database. Although these links normally follow the genus and species information given, in some cases this information is either absent in the PDB record or refers only to how a sample was obtained. In these cases, the staff manually enters the appropriate taxonomy links.
The page consists of three parts: the header, the view bar, and the graphic display. The header contains basic identifying information about the record: a description of the protein (Description:), the author list (Deposition:), the species of origin (Taxonomy:), literature references (Reference:), the MMDB-ID (MMDB:), and the PDB code (PDB:). Several of these data serve as links to additional information. For example, the species name links to the Taxonomy browser, the literature references link to PubMed, and the PDB code links to the PDB Web site. The view bar allows the user to view the structure record either as a graphic with Cn3D or as a text record in either ASN.1, PDB (RasMol), or Mage formats. The latter can also be downloaded directly from this page. The graphic display contains a variety of information and links to related databases: (a) The Chain bar. Each chain of the molecule is displayed as a dark bar labeled with residue numbers. To the left of this bar is a Protein hyperlink that takes the user to a view of the protein record in Entrez Protein. The bar itself is also a hyperlink and displays the VAST neighbors of the chain. If a structure contains nucleotide sequences, they are displayed in the order contained in the PDB record. A Nucleotide hyperlink to their left takes the user to the appropriate record in Entrez Nucleotide. (b) The VAST (3D) Domain bar. The colored bars immediately below the Chain bar indicate the locations of structural domains found by the original MMDB processing of the protein. In many cases, such a domain contains unconnected sections of the protein sequence, and in such cases, discontinuous pieces making up the domain will have bars of the same color. To the left of the Domain bar is a 3D Domains hyperlink (3d Domains) that launches the 3D Domains browser in Entrez, where the user can find information about each constituent domain. Selecting a colored segment displays the VAST Structure Neighbors page for that domain. (c) The CD bar. Below the VAST Domain bar are rounded, rectangular bars representing conserved domains found by a CD-Search. The bars identify the best scoring hits; overlapping hits are shown only if the mutual overlap with hits having better scores is less than 50%. The CDs hyperlink to the left of the bar displays the CD records in Entrez Domains. Each of the colored bars is also a hyperlink that displays the corresponding CD Summary page configured to show the multiple alignment of the protein sequence with members of the selected CD.
The top portion of the page contains identifying information about the 3D Domain, along with three functional bars. (a) The View bar. This bar allows a user to view a selected alignment either as a graphic using Cn3D or as a sequence alignment in HTML, text, or mFASTA format. The user may select which chains to display in the alignment by checking the boxes that appear to the left of each neighbor in the lower portion of the page. (b) The nr-PDB bar. This bar allows a user to either display all matching records in MMDB or to limit the displayed domains to only representatives of the selected nr-PDB set. The user may also select how the matching domains are sorted in the display and whether the results are shown as graphics or as tabulated data. (c) The Find bar. This bar allows the user to find specific structural neighbors by entering their PDB or MMDB identifiers. (d) The lower portion of the page displays a graphical alignment of the various matching domains. The upper three bars show summary information about the query sequence: the top bar shows the maximum extent of alignment found on all the sequences displayed on the current page (users should note that the appearance of this bar, therefore, depends on which hits are displayed); the middle bar represents the query sequence itself that served as input for the VAST search; and the lower bar shows any matching CDs and is identical to the CD bar on the Structure Summary page. Listed below these three summary bars are the hits from the VAST search, sorted according to the selection in the nr-PDB bar. Aligned regions are shown in red, with gaps indicating unaligned regions. To the left of each domain accession is a check box that can be used to select any combination of domains to be displayed either on this page or using Cn3D. Moreover, each of the bars in the display is itself a link, and placing the mouse pointer over any bar reveals both the extent of the alignment by residue number and the data linked to the bar.
The non-redundant PDB database (nr-PDB) is a collection of four sets of sequence-dissimilar cluster PDB polypeptide chains assembled by NCBI Structure staff. The four sets differ only in their respective levels of non-redundancy. The staff assembles each set by comparing all the chains available from PDB with each other using the BLAST algorithm. The chains are then clustered into groups of similar sequence using a single-linkage clustering procedure. Chains within a sequence-similar group are automatically ranked according to the quality of their structural data. Details of the measures used to determine structure precision and completeness and the methodology of assembling the nr-PDB clusters can be found on the nr-PDB Web page.
CDs are recurring units in polypeptide chains (sequence and structure motifs), the extents of which can be determined by comparative analysis. Molecular evolution uses such domains as building blocks and these may be recombined in different arrangements to make different proteins with different functions. The CDD contains sequence alignments that define the features that are conserved within each domain family. Therefore, the CDD serves as a classification resource that groups proteins based on the presence of these predefined domains. CDD entries often name the domain family and describe the role of conserved residues in binding or catalysis. Conserved domains are displayed in MMDB Structure summaries and link to a sequence alignment showing other proteins in which the domain is conserved, which may provide clues on protein function.
The collections of domain alignments in the CDD are imported either from two databases outside of the NCBI, named Pfam (8) and SMART (9); from the NCBI COB database; from another NCBI collection named LOAD; and from a database curated by the CDD staff. The first task is to identify the underlying sequences in each collection and then link these sequences to the corresponding ones in Entrez. If the CDD staff cannot find the Accession numbers for the sequences in the records from the source databases, they locate appropriate sequences using BLAST. Particular attention is paid to any resulting match that is linked to a structure record in MMDB, and the staff substitute alignment rows with such sequences whenever possible. After the staff imports a collection, they then choose a sequence that best represents the family. Whenever possible, the staff chooses a representative that has a structure record in MMDB.
Once imported and constructed, each domain alignment in CDD is used to calculate a model sequence, called a consensus sequence, for each CD. The consensus sequence lists the most frequently found residue in each position in the alignment; however, for a sequence position to be included in the consensus sequence, it must be present in at least 50% of the aligned sequences. Aligned columns covered by the consensus sequence are then used to calculate a PSSM, which memorizes the degree to which particular residues are conserved at each position in the sequence. Once calculated, the PSSM is stored with the alignment and becomes part of the CDD. The RPS-BLAST tool locates CDs within a query sequence by searching against this database of PSSMs.
RPS-BLAST (Chapter 16) is a variant of the popular Position-specific Iterated BLAST (PSI-BLAST) program. PSI-BLAST finds sequences similar to the query and uses the resulting alignments to build a PSSM for the query. With this PSSM the database is scanned again to draw in more hits and further refine the scoring model. RPS-BLAST uses a query sequence to search a database of precalculated PSSMs and report significant hits in a single pass. The role of the PSSM has changed from “query” to “subject”; hence, the term “reverse” in RPS-BLAST. RPS-BLAST is the search tool used in the CD-Search service.
The top of the page serves as a header and reports a variety of identifying information, including the name and description of the CD, other related CDs with links to their summary pages, as well as the source database, status, and creation date of the CD. A taxonomic node link (Taxa:) launches the Taxonomy Browser, whereas a Proteins link (Proteins:) uses CDART to show other proteins that contain the CD. Below the header is the interface for viewing the CD alignment, which can be done either graphically with Cn3D (if the CD contains a sequence with structural data) or in HTML, text, or mFASTA format. It is also possible to view a selected number of the top-listed sequences, sequences from the most diverse members, or sequences most similar to the query. In addition, users may now select sequences with the NCBI Taxonomy Common Tree tool. The lower portion of the page contains the alignment itself. Members with a structural record in MMDB are listed first, and the identifier of each sequence links to the corresponding record.
The upper window displays the structure of the domain with the residues colored according to their sequence conservation, with red indicating high conservation and blue indicating low conservation. The nucleotide bound at site II is shown as an orange space-filling model, and the residues involved in this binding site are yellow. The lower window displays the sequence alignment for the domain with aligned residues shown as colored capital letters. Residues aligned to three of the binding site residues are highlighted in yellow. The sequence for NP_004609 (gi 10835218) occupies the bottom row.
The term “domain” refers in general to a distinct functional and/or structural unit of a protein. Each polypeptide chain in MMDB is analyzed for the presence of two classes of domains, and it is important for users to understand the difference between them. One class, called 3D Domains, is based solely on similar, compact substructures, whereas the second class, called Conserved Domains (CDs), is based solely on conserved sequence motifs. These two classifications often agree, because the compact substructures within a protein often correspond to domains joined by recombination in the evolutionary history of a protein. Note that CD links can be identified even when no 3D structures within a family are known. Moreover, 3D Domain links may also indicate relationships either to structures not included in CDD entries or to structures so distantly related that no significant similarity can be found by sequence comparisons.
For an example query on finding and viewing structures, see Box 2.
The backbone atoms of the aligned residues of the three structures are shown colored according to their sequence conservation of each position in the alignment. Highly conserved positions are colored more red, whereas poorly conserved positions are colored more blue. The bound pyridoxal phosphate ligands are yellow.
To determine the overall shape and size of a protein
To locate a residue of interest in the overall structure
To locate residues in close proximity to a residue of interest
To develop or test chemical hypotheses regarding an enzyme mechanism
To locate or predict possible binding sites of a ligand
To interpret mutation studies
To find areas of positive or negative charge on the protein surface
To locate particularly hydrophobic or hydrophilic regions of a protein
To infer the 3D structure and related properties of a protein with unknown structure from the structure of a homologous protein
To study evolutionary processes at the level of molecular structure
To study the function of a protein
To study the molecular basis of disease and design novel treatments
The first step to any structural analysis at NCBI is to find the structure records for the protein of interest or for proteins similar to it. One may search MMDB directly by entering search terms such as PDB code, protein name, author, or journal in the Entrez Structure Search box on the Structure homepage. Alternative points of entry are shown below.
By using the full array of Entrez search tools, the resulting list of MMDB records can be honed, ideally, to a workable list from which a record can be selected. Users should note that multiple records may exist for a given protein, reflecting different experimental techniques, conditions, and the presence or absence of various ligands or metal ions. Records may also contain different fragments of the full-length molecule. In addition, many structures of mutant proteins are also available. The PDB record for a given structure generally contains some description of the experimental conditions under which the structure was determined, and this file can be accessed by selecting the PDB code link at the top of the Structure Summary page.
Structure Summary pages can also be found from the following NCBI databases and tools:
Select the Structure links to the right of any Entrez record found; records with Structure links can also be located by choosing Structure links from the Display pull-down menu.
Select the Related Sequences link to the right of an Entrez record to find proteins related by sequence similarity and then select Structure links in the Display pull-down menu.
Choose the PDB database from a blastp (protein-protein BLAST) search; only sequences with structure records will be retrieved by BLAST. The Related Structures link provides 3D views in Cn3D.
Select the 3D Structures button on any BLink report to show those BLAST hits for which structural data are available.
From the results of any protein BLAST search, click on a red 'S' linkout to view the sequence alignment with a structure record.
The 3D domains of a protein are displayed on the Structure Summary page. It is useful to know how many 3D domains a protein contains and whether they are continuous in sequence when viewing the full 3D structure of the molecule.
Knowing the secondary structure of a protein can also be a useful prelude to viewing the 3D structure of the molecule. The secondary structure can be viewed easily by first selecting the Protein link to the left of the desired chain in the graphic display. Finding oneself in Entrez Protein, selecting Graphics in the Display pull-down menu presents secondary structure diagrams for the molecule.
Cn3D is a software package for displaying 3D structures of proteins. Once it has been installed and the Internet browser has been configured correctly, simply selecting the View 3D Structure button on a Structure Summary page launches the application. Once the structure is loaded, a user can manipulate and annotate it using an array of options as described in the Cn3D Tutorial. By default, Cn3D colors the structure according to the secondary structure elements. However, another useful view is to color the protein by domain (see Style menu options), using the same color scheme as is shown in the graphic display on the Structure Summary page. These color changes also affect the residues displayed in the Sequence/Alignment Viewer, allowing the identification of domain or secondary structure elements in the primary sequence. In addition to Cn3D, users can also display 3D structures with RasMol or Mage. Structures can also be saved locally as an ASN.1, PDB, or Mage file (depending on the choice of structure viewer) for later display.
To determine structurally conserved regions in a protein family
To locate the structural equivalent of a residue of interest in another related protein
To gain insights into the allowable structural variability in a particular protein family
To develop or test chemical hypotheses regarding an enzyme mechanism
To predict possible binding sites of a ligand from the location of a binding site in a related protein
To identify sites where conformational changes are concentrated
To interpret mutation studies
To find areas of conserved positive or negative charge on the protein surface
To locate conserved hydrophobic or hydrophilic regions of a protein
To identify evolutionary relationships across protein families
To identify functionally equivalent proteins with little or no sequence conservation
From any Entrez search, select Related 3D Domains to the right of any record found to view the Vast Structure Neighbors page.
A graphic 2D HTML alignment of VAST neighbors can be viewed as follows:
Alignments of VAST structure neighbors can be viewed as a 3D image using Cn3D.
On the View/Save bar, configure the pull-down menus to the right of the View 3D Structure button.
Select View 3D Structure.
Cn3D automatically launches and displays the aligned structures. Each displayed chain has a unique color; however, the portions of the structures involved in the alignment are shown in red. These same colors are also reflected in the Sequence/Alignment Viewer. Among the many viewing options provided by Cn3D, of particular use is the Show/Hide menu that allows only the aligned residues to be viewed, only the aligned domains, or all residues of each chain.
Following the Domains link for any protein in Entrez, one can find the conserved domains within that protein. The CD-Search (or Protein BLAST, with CD-Search option selected) can be used to find conserved domains (CDs) within a protein. Either the Accession number, gi number, or the FASTA sequence can be used as a query.
Information on the CDs contained within a protein can also be found from these databases and tools:
From any Entrez search: select the Domains link to the right of a displayed record.
From the Structure Summary page of a MMDB record: this page displays the CDs within each protein chain immediately below the 3D Domain bar in the graphic display. Selecting the CDs link shows the CD-Search results page.
From an Entrez Domains search: choose Domains from the Entrez Search pull-down menu and enter a search term to retrieve a list of CDs. Clicking on any resulting CD displays the CD Summary page. To find the location of this CD in an aligned protein, select the CD link following a protein name in the bottom portion of this page.
From the CDD page: locate CDs by entering text terms into the search box and proceed as for an Entrez CD search.
From a BLink report: select the CDD-Search button to display the CD-Search results page.
From the BLAST main page: follow the RPS-BLAST link to load the CD-Search page.
Results from a CD search are displayed as colored bars underneath a sequence ruler. Moving the mouse over these bars reveals the identity of each domain; domains are also listed in a format similar to BLAST summary output (Chapter 16). Pairwise alignments between the matched region of the target protein and the representative sequence of each domain are shown below the bar. Red letters indicate residues identical to those in the representative sequence, whereas blue letters indicate residues with a positive BLOSUM62 score in the BLAST alignment.
These can be displayed by clicking a CD bar within a MMDB Structure Summary page or from a hyperlinked CD name on a CD-Search results page.
If members of a CD have MMDB records, one of these records can be viewed as a 3D image along with the sequence alignment using Cn3D (launched by selecting the pink dot on a CD-Search results page). As in other alignment views, colored capital letters indicate aligned residues, allowing the sequence of the protein sequence of interest to be mapped onto the available 3D structure.
To locate related functional domains in other protein families
To gain insights into how a given CD is situated within a protein relative to other CDs
To explore functional links between different CDs
To predict the function of a protein whose function is unknown
To establish evolutionary relationships across protein families
Following the Domain Relatives link for any protein in Entrez, one can find other proteins with similar domain architecture. The Conserved Domain Architecture Retrieval Tool (CDART) can take an Accession number or the FASTA sequence as a query to find out the domain architecture of a protein sequence and list other proteins with related domain architectures.
At the top of the CDART results page in a yellow box, the query sequence CDs are represented as “beads on a string”. Each CD had a unique color and shape and is labeled both in the display itself and in a legend located at the bottom of the page. The shapes representing CDs are hyperlinked to the corresponding CD summary page. The matching proteins to the query are listed below the yellow box, ranked according to the number of non-redundant hits to the domains in the query sequence. Each match is either a single protein, in which case its Accession number is shown, or is a cluster of very similar proteins, in which case the number of members in the cluster is shown. Cluster members can be displayed by selecting the logo to the left of its diagram. Selecting any protein Accession number displays the flatfile for that protein. To the right of any drawing for a single protein (either on the main results page or after expanding a protein cluster) is a more> link, which displays the CD-Search results page for the selected protein so that the sequence alignment, e.g., of a CDART hit with a CD contained in the original protein of interest, can be examined.
As illustrated in the sections above, there are numerous connections between the Structure resources and other databases and tools available at the NCBI. What follows is a listing of major tools that support connections.
Because Entrez is an integrated database system (Chapter 15), the links attached to each structure give immediate access to PubMed, Protein, Nucleotide, 3D Domain, CDD, or Taxonomy records.
Although the BLAST service is designed to find matches based solely on sequence, the sequences of Structure records are included in the BLAST databases, and by selecting the PDB search database, BLAST searches only the protein sequences provided by MMDB records. A new Related Structure link provides 3D views for sequences with structure data identified in a BLAST search.
The BLink report represents a precomputed list of similar proteins for many proteins (see, for example, links from Entrez Gene records; Chapter 19). The 3D Structures option on any BLink report shows the BLAST hits that have 3D structure data in MMDB, whereas the CDD-Search button displays the CD-Search results page for the query protein.
A particularly useful interface with the structural databases is provided on the Microbial Genomes page (10). To the left of the list of genomes are several hyperlinks, two of which offer users direct access to structural information. The red [D] link displays a listing of every protein in the genome, each with a link to a BLink page showing the results of a BLAST pdb search for that protein. The [S] link displays a similar protein list for the selected genome, but now with a listing of the conserved domains found in each protein by a CD-Search.
As stated elsewhere, all records in the MMDB are obtained originally from the Protein Data Bank (PDB) (6). Links to the original PDB records are located on the Structure Summary page of each MMDB record. Updates of the MMDB with new PDB records occur once a month.
The CDD staff imports CD collections from both the Pfam and SMART databases. Links to the original records in these databases are located on the appropriate CD Summary page. Both Pfam and SMART are updated several times per year in roughly bimonthly intervals, and the CDD staff update CDD accordingly.
Structures displayed in Cn3D can be exported as a Portable Network Graphics (PNG) file from within Cn3D (the Export PNG command in the File menu). The structure file itself, in the orientation currently being viewed, can also be saved for later launching in Cn3D.
Users can download the NCBI Structure databases from the NCBI FTP site: ftp://ftp.ncbi.nih.gov/mmdb. A Readme file contains descriptions of the contents and information about recent updates. Within the mmdb directory are four subdirectories that contain the following data:
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]
Free Full text in PMC]