NCBI » Bookshelf » NCBI Help Manual » Protein Clusters Help » Protein Clusters: A Collection of Proteins Grouped by Sequence Similarity and Function
 
helpcluster
Protein Clusters Help
National Center for Biotechnology Information (NCBI)
biotechnology information

Protein Clusters: A Collection of Proteins Grouped by Sequence Similarity and Function

Kathleen ONeill, Ph.D
NCBI
William Klimke, Ph.D
NCBI
Tatiana Tatusova, Ph.D
NCBI
24092009helpcluster
Created: May 1, 2007.
Last Update: September 24, 2009.

Introduction

The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes, plasmids, viruses (including viruses), organelles, and complete and incomplete genomes of protozoa grouped and annotated based on sequence similarity and protein function (1). Proteins are automatically grouped into clusters based on reciprocal best-hit BLAST scores. The protein clusters database is updated at quarterly intervals whereupon validation and quality assessment processes occur. Therefore, there is a 3-month delay for the incorporation of proteins from new genomes into protein clusters because of the processing time necessary.

Clusters in the protein clusters database (protclustdb) are named and functional descriptions are assigned by manual curation. Alignments, information on genome neighborhood, and links to NCBI and external databases are provided for each protein cluster. Specific query and search terms can be found under Querying and Searching.

Proteins encoded by prokaryotes, plasmids, viruses, organelles, and complete and incomplete genomes of protozoa are contained within separate cluster groups. Each cluster in the database has a unique identifying number (UID). Each cluster also has an accession number consisting of a three or four-letter code followed by five numbers. Accession numbers are somewhat stable between releases unless the curated cluster is removed, split into subclusters, or joined with another curated cluster (this information is not tracked between releases).

Protein Cluster DatabaseCluster Number
Curated Prokaryotic Protein ClustersPRK#####
Curated Chloroplast Protein ClustersCHL#####
Uncurated Prokaryote Protein ClustersCLS#####
Uncurated Chloroplast Protein ClustersCLSC#####
Curated Mitochondrial Protein ClustersMTH#####
Uncurated Mitochondrial Protein ClustersCLSM#####
Curated Virus Protein ClustersPHA#####
Uncurated Virus Protein ClustersCLSP#####
Curated Protozoan Protein ClustersPTZ#####
Uncurated Protozoan Protein ClustersCLSZ#####

The clusters are divided into curated and non-curated sets. Non-curated clusters are automatically generated and have not yet been manually annotated. Manual curation involves confirming protein domain structure, joining related clusters, addition of publications, and functional annotation. The status of curated clusters is provisional, validated, or reviewed depending upon the level of curation. Provisional status indicates minimal curation, validated status indicates a moderate level of curation, and reviewed status is applied to clusters after more extensive review. Validated and reviewed curated clusters have consistent nomenclature and protein function descriptions and are used in the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) as well as in cleanup of RefSeq records.

Cluster Creation

RefSeq Proteins from all complete prokaryotic genomes and plasmids (or, separately organelles, viruses, or protozoa) are compared using BLAST all-against-all. Protein clusters are created using a modified BLAST score that takes into account the length of the hit (alignment) versus both the query and the subject. The modified score is then used to numerically sort the BLAST results, and all proteins that are contained within the top hits are clustered together. Note: this procedure is likely to be modified in the future.

From the set of raw clusters, curators join and split clusters, and add annotation information (protein name, gene name, description), and publication links. The cluster annotation is occasionally propagated to every RefSeq protein that belongs to a specific cluster in order to generate more consistent protein and gene annotation on RefSeq records.

Entrez Protein Cluster Overview

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure1.jpg.

Figure 1. Protein cluster overview. The main page for the protein clusters showing a representative protein cluster (PRK09525 - beta-galactosidase). Certain sections are collapsible for compact viewing, or expandable for a more detailed display (blue arrows). A. Core curated information includes the cluster number, curation status, protein name, and gene name. Additional descriptive curated information is found immediately below the blue bar as well as the enzyme commission number (cross reference section), and curated publications (publication section). B. Cluster info contains basic statistical information about the cluster, the identifier (UID), total proteins, genera, organisms, paralogs, and publications along with the taxonomic conservation level. C. Cluster tools provide analytical tools for the analyses of alignments, construction of phylogenetic trees, and examination of genomic neighborhoods in all sequences (ProtMap) or as conserved patterns. D. Cross references contain important links to other databases at NCBI or external databases. Specific links to other Entrez Databases are below the cross reference section. E. Domain description (CDD), COG category, and BRITE hierarchy (KEGG) provide more general information for a protein cluster. F. Publications are either curated, or collected automatically for proteins (RefSeq, SwissProt, structures, via sequence similarity), domains (CDD), and genes (GeneRIFs) and displayed by category in a semi-collapsed (only one publication per category) state which can be expanded to the full list, or the full list of publications can be displayed in PubMed. G. Top pattern links to the cluster pattern display.

The display (overview) for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively (Figure 1).

Curation may involve joining or splitting clusters, or simply adding annotation information. Annotation includes the protein name, gene name and synonyms, a brief description, Enzyme Commission number, and curated publication links.

Automatically collected information is available for both curated and noncurated clusters and is displayed in the Cross Reference section and other sections of the overview. Domain information is collected from the NCBI Conserved Domain Database (CDD), functional classification is collected from the Cluster of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes orthologous clusters (KEGG - KO). Pathway and hierarchy information are also obtained from KEGG (BRITE hierarchy). Publication links are automatically collected and shown in the Publication Links section.

The information highlighted in blue at the top of the page contains the core curated identifiers and information (Figure 1A). The cluster accession and the curation status are at the left, the name of the cluster/protein in the middle, and the gene name or synonym (if annotated) at the right.

Below the core curated information is a description that contains curated information on the proteins and their function, domain descriptions, COG functional categories, and KEGG BRITE hierarchy showing functional classification (Figure 1E).

Cluster Info

Cluster Info (Figure 1B) displays information specific to the cluster. On the left-hand side below the blue bar are some specific statistics about the protein cluster:

  • unique identifier

  • total proteins in the cluster

  • common taxonomic node

  • total number of genera

  • total number of organisms

  • putative number of paralogs (two or more proteins encoded by genes on the same nucleotide)

  • total number of publications

Cross References

Cross references (Figure 1D) are calculated at the level of each protein (ex. domain assignment for a particular protein), collected from all proteins in a given cluster, and finally displayed in the cross-reference section, which provides links to similar clusters, information on protein families, metabolic roles, conserved protein domains, and protein structure. External links to these databases are also provided on the home page for Entrez Protein Clusters.

Related Clusters

Related Clusters (protein clusters related by sequence similarity) are shown in a line below the publication links: from left to right from most to least similar. Similar clusters are calculated from the complete set of all clusters from all organisms and taxonomic groups, including curated and noncurated clusters.

The ranking of related clusters is calculated from the average BLAST score (average bitscore). The average score is the sum of all scores for protein pairs over the sum of all pairs between the full set of proteins in both clusters. Related clusters with the highest score are shown on the left. The top ten (ranked by score) related clusters are displayed (if available) and mouseover displays the curated name, and the individual cluster is available by clicking on the link. All related clusters are accessible from the link displaying the total number of related clusters. Clusters are color-coded by COG functional categories.

Cluster Patterns

Cluster Patterns are contiguous sets of clusters, curated or noncurated, conserved across multiple nucleotide sequences. The top pattern (containing the maximal protein count – maximum 13 clusters in the pattern for display purposes) is shown (Figure 1G) and links to the Cluster Pattern display page

Protein Table

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure2.jpg.

Figure 2. Protein table. A protein table from the main display page of a representative protein cluster (PRK09525 – beta-galactosidase). A. Column headings and controls. The headings from left to right are organism, protein name, previous cluster, accession, next cluster, locus_tag, Blink, and alignment. The controls under the organism include collapse (all organism groups will be collapsed), highlight paralogs (two or more proteins encoded by the same nucleotide sequence are considered paralogs) will highlight those with a yellow color (if any are present), and limit to paralogs will limit the entire table to only paralogs (if any). B. List of all proteins in the table grouped by organism group and shown alphabetically. Each row consists of one protein, and alternating rows are shaded for visualization. Each organism group can be collapsed (those with +/- beside them) individually or with the control in the top bar. Organisms can be selected as entire groups, or individually. Checking an organism will result in the selections being highlighted in some of the cluster tools (alignment, tree, ProtMap). Each column is a link to another Entrez database or tool: organism (taxonomy), protein name and accession (protein), and locus_tag (Gene). The diamond in the Blink column is a direct link to pre-computed protein-protein BLAST results (Blink) if any are available. The previous and next cluster links (name available on mouseover) represent the local genomic neighborhood for each nucleotide sequence. Genes in the 5’ or 3’ direction that encode proteins that belong to a cluster are shown color-coded according to COG functional category, with the cluster accession shown. Non-colored clusters indicate no COG functional category, and a blank space indicates a gene that either does not encode a protein belonging to a cluster, or that does not encode a protein at all (an RNA for example). The full genomic neighborhood is available in the cluster tools section (ProtMap). The alignment column is a simplistic view of the full alignment, showing aligned residues as gray blocks, and gaps as blank spaces. The conserved domain assignment is shown in the alignment view as colored bars below each sequence. Each colored bar is a different domain and the name is displayed during mouseover. Clicking on the domain will open that particular domain in CDD. The list of domains corresponds to those found in the cross reference section. If all proteins have identical domain assignments, then only the domain assignment on the top protein will be shown. All protein-domain assignments can be displayed with the expand/collapse control on the first protein. Protein sequences that are 100% identical will be boxed in the alignment column. Clicking anywhere in the alignment will open the alignment viewer on the position clicked on.

The protein table (Figure 2) contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus, the table can be collapsed or expanded. Selected sequences or groups at higher taxonomic levels can be highlighted in the alignment or distance tree by checking the box next to the genus or species.

The organism, protein name, accession, locus tag, and length are provided. To provide information on genome neighborhood, genes upstream and downstream of the gene encoding the protein are checked for cluster assignment. If these genes belong to a cluster then they are they displayed with their cluster assignment and are color-coded by COG functional category (genes that do not belong to a cluster are not shown). Blink is pre-computed blast results. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database or tool.

Cluster Tools

Cluster Tools (Figure 1C) contains several methods of displaying cluster details or more detailed analysis including multiple alignments, phylogenetic trees, and local genomic neighborhood (ProtMap and cluster patterns).

Sequence And Phylogenetic Analysis

Show Detailed Alignment

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure3.jpg.

Figure 3. Detailed alignment. Aligned proteins in a representative cluster (PRK13343 – F0F1 ATP synthase subunit alpha). A. The cluster accession and presentation mode are in the top row, while the second row shows cluster name, number of proteins, number of domains, alignment length, alignment position box, and download button. All domains can be shown on all proteins with the show/hide all button. A specific position in the alignment can be viewed with the set position box which will show that position on the left edge (if less than total position in the alignment) or in the right edge if at the maximum length. The download button is for a FASTA+Gap text file. B. The alignment from A displayed in AA Property mode (coloring scheme explained). Alignment position, consensus sequence, and features are in the top three rows. The feature line displays important features collected by structure group and coincides with the features present in the Conserved Domain database. The example shows a mouseover view of one of the residues in the feature showing the ATP binding site for the F1_ATPase. The alignment below the top three lines shows the list of proteins and their residues. Each protein can be expanded to show corresponding domain assignments. In the example, two proteins show domain assignments, the blue (cd01132 – F1_ATPase_alpha) and the green domain (pfam00306 – ATP-synt_ab_C). C. The alignment from A and B displayed as Consensus mode.

Show detailed alignment displays a scrollable, multiple sequence alignment (Figure 3). Selected sequences in the alignment may be highlighted by checking the box next to the species or genus on the protein table (Figure 2B) before displaying the alignment. Amino acids in the alignment display can be shown as either a consensus (only amino acids that differ from the consensus sequence are shown – Figure 3C) or by conservation (conserved amino acids highlighted by color – Figure3B). The amino acid coloration scheme is based on the physical and/or chemical properties of the amino acids. Similarities are highlighted if at least 80% of residues in a column are identical or fall into at least one of the following amino acid groups: (FHWY); aliphatic (ILVA); hydrophobic (ACFILMVWY); alcohol (STC); charged (DEHKR); positive (HKR); negative (DE); polar (CDEHKNQRST); tiny (AGS); small (ACDGNPSTV); or bulky (EFIKLMQRWY). Features are shown in the alignment for critical residues obtained from CDD (via structure) and Conserved Domains can be displayed by clicking the Show/hide all in the Domains box or, for any individual sequence, by clicking the + or − next to the protein accession.

Build Tree

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure4.jpg.

Figure 4. Phylogenetic tree. Tree visualization of a representative cluster (PRK05159 – aspartyl-tRNA synthetase) created using Fast Minimum Evolution and Kimura distances. The tree has been collapsed at the level of genus, and some manipulations have been performed to achieve the image. Each node can be manipulated with a left click and can be collapsed, squeezed, expanded, or the entire tree can be rerooted. Collapsed nodes show the set of proteins associated with each branch, while a squeezed node is compressed for visualization purposes. In the above example one of the Euryarchaeota nodes has been squeezed. Collapsed and squeezed nodes can be expanded back to their full size.

Build tree produces a distance tree based on the multiple alignment (Figure 4). The tree construction method can be chosen by using the drop-down menus. Different distance measures are available for calculating protein pairwise distances either using an mPam amino acid substitution matrix which reflects evolutionary bias of amino acid substitutions, the Kimura approximation for protein sequences implemented in the Flu database, or a Poisson model of amino acid substitutions which assumes an equal rate of substitution between amino acids and among the lineage (2). Tree construction uses either a fast minimum evolution which estimates the length of any given topology and then selects the topology with the shortest length or Neighbor-Joining where at each step, a pair with a smallest value of Dij - bi - bj is chosen, where Dij is the distance between nodes i and j, and bi = kn Dij /(n-2). The distance between the new node u and each of remaining nodes is defined as Duk = (Dik + Djk - Dij ) /2. Branch lengths are defined as vui = (Dij + bi - bj ) /2 and vuj = (Dij + bj - bi ) /2 (negative lengths are truncated to zero).

Genomic Neighborhood

Two tools allow an analysis of the genomic neighborhood: ProtMap and cluster patterns (Figure 1).

Genome ProtMap

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure5.jpg.

Figure 5. ProtMap. A representative protein cluster is displayed with the ProtMap tool (PRK05159 – aspartyl-tRNA synthetase - the same cluster shown in Figure 2) showing local genomic neighborhood centered on a set of related genes (via the proteins encoded by them). A. Description and search box. ProtMap currently can be used to visualize genomic neighborhoods of COGs, VOGs, and protein clusters. Below the search box is the cluster accession (with links to protein cluster overview), name, and number of nucleotide sequences in the ProtMap display. Show legends displays color-coded COG functional categories and show cluster colors shows the full list of clusters from the ProtMap by accession and name (color-coded as in ProtMap). B. A representative slice of the ProtMap for the Methanococcaceae (showing the region corresponding to the checked taxa from Figure 2 - highlighted in yellow). The organism group (link to taxonomy) and RefSeq Accession Numbers (link to nucleotide) and length are in the leftmost column while the next column contains the organism name. The actual protein map in the rightmost column shows local genomic neighborhood centered around each protein family (Protein Cluster, VOG, COG – yellow boxes with red highlight show the protein family) with a 10 kb window centered around each protein family. All proteins in the surrounding area are color-coded by COG category (if applicable) or gray (proteins that do not belong to a COG). Genes that do not encode proteins are not displayed in the ProtMap tool. The protein name, locus_tag, protein cluster, and location are shown for all proteins on mouseover, and clicking on each protein will display a menu with options for protein, gene, or ProtMap links (if applicable). C. ProtMap display for the COG corresponding to PRK05159 (COG0017J - aspartyl/asparaginyl-tRNA synthetases). The organism list is slightly different (those belong to the COG, not the protein cluster) but the map display is similar. In addition, segment links below the nucleotide sequence are links to gMap (genomic map) which can be used to delimit the ProtMap with sequences that only contain that segment, or where the segment can be viewed in gMap. D. gMap view of the blue segment from C (labeled #60). gMap contains precomputed genomic similarity using BLAST to detect syntenic blocks across multiple species. A segment (blue arrow labeled #2) is found in the Methanosarcinaceae that corresponds to the nucleotide sequence for the aspC gene (which encodes the proteins found in COG COG0017J and in protein cluster PRK05159). gMap links to proteins, and segments can be zoomed in or out (greater or lesser resolution).

Genome ProtMap (Figure 5) displays a 10-kb region surrounding (both directions) all of the proteins in the cluster (Genome ProtMap by cluster accession) or in all the proteins that have the same Cluster of Orthologous Group - COG (Genome ProtMap by COG#####). Although the ProtMap may appear to be an alignment, it is not. Instead, all of the genes encoding the proteins that are members of the cluster are drawn in the same 5’ – 3’ direction in the center of the display.

Cluster Patterns

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure6.jpg.

Figure 6. Cluster pattern. A representative cluster pattern for PRK13024 (bifunctional SecD/SecF). A. The top bar shows the cluster number, curated name, number of proteins, and a download option for a tab-delimited table. The same two links that are available in ProtMap, ‘show legend’ and ‘show cluster colors’ display the colored COG functional categories and a list of all clusters from all patterns in the current view, respectively. B. Cluster pattern info showing pattern number, from most to least conserved, the number of proteins contributing to each pattern, the number of clusters in each pattern, and the common taxonomic node for each pattern. Links to Entrez databases (proteins, protein clusters, taxonomy) and ProtMap display for each pattern are available. There is an option to view a single pattern (row) in ProtMap. C. Each cluster pattern is shown in a row, with clusters color-coded according to COG functional category. Clusters encoded by genes on the complementary strand are shown with an arrow from right to left. Clusters are semi-aligned: clusters with the same accessions are aligned in columns, gaps (which appear as dark gray boxes) are inserted to align clusters. The cluster highlighted in orange in a single column is the one from which the patterns are derived (similar to ProtMap - in this case PRK13024 – bifunctional preprotein translocase subunit SecD/SecF). The cluster name is shown upon mouseover on each cluster in the pattern and links to Entrez Protein Clusters, and ProtMap and pattern display tools are available when clicking on each cluster number (boxed “queuine tRNA-ribosyltransferase”.

Cluster Patterns (Figure 6) are contiguous sets of clusters, curated or noncurated, conserved across multiple nucleotide sequences. The Cluster Pattern display shows a taxonomically compressed version of that shown in ProtMap but with a potentially broader set of clusters along a set of nucleotide sequences. The patterns are derived in the following manner.

Each pattern must have > 3 clusters

Each pattern must have > 2 proteins

Each protein belongs only to a single pattern

Example:

If six genomes contain genes that encode proteins belonging to clusters in the following manner (A, B, and C are taxonomic nodes, genes are genes along the chromosome, and CL# are cluster numbers associated with each protein encoded by the genes):

Sequencegene1gene2gene3gene4gene5gene6gene7
A1CL1CL4CL5CL6CL8CL11CL14
A2CL2CL4CL5CL6CL9CL12CL15
A3CL3CL4CL5CL6CL10CL13CL16
B1CL7CL5CL4
B2CL7CL5CL4
B3CL7CL5CL4
C1CL5CL17

Then the resulting patterns will be:

proteinsclusterstaxonomy
33ACL4CL5CL6
33BCL7CL5CL4

Patterns are collected in a maximum 80 000 bp window centered on the gene encoding the protein which is a member of the cluster. Patterns contain maximum number of proteins (ie. not max size along the genome, but max size in terms of conservation – even if the final pattern is smaller as a result). Currently this procedure may result in non-optimal solutions and some clusters may not have sufficient conservation to generate any patterns at all.

The top pattern (containing the maximal protein count – maximum 13 clusters in the pattern for display purposes) is shown above the protein table on the overview page (Figure 1G). All of the patterns from a given cluster are displayed on the Cluster Pattern tool display.

Querying and Searching

There are numerous ways to query protein clusters, either with search terms in Entrez or with a protein or nucleotide sequence using either CD-Search or Concise Protein BLAST.

Limits

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure7.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure7.jpg.
Figure 7. Protein clusters limits page. For searching (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure7.jpg.

Figure 7. Protein clusters limits page. For searching the protein clusters showing the options to limit a search. A. Drop down menu for specific field to limit as described in the table. B. A series of checkboxes can be used to limit searches to status type, source, or by taxonomy.

The Limits tab on the search bar allows search limits to be set from a drop-down menu (Figure 7).

After selecting a limit, the selected field will show up in the yellow bar behind the Field tag. Searches can also be refined by checking the desired box(es) in the table to limit by curation status, source, or taxonomy. The Limits checkbox will also be marked and will remain through subsequent searches. To remove the limits for a particular search, deselect the checkbox.

The following table summarizes the various limits and properties that can be used to refine searches.

Field nameDefinition [including field abbreviations]Examples
AccessionUnique identifier for each cluster. [ACCN][ACCESSION]Retrieve cluster with the accession PRK09525: PRK09525 [ACCN]
Average LengthAverage length of proteins in the cluster. [Average Length]Retrieve all clusters with an average protein length of 100–300 amino acids: 100:300[Average Length]
COGCOG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes. [COG]Retrieve all clusters with COG3250: COG3250[COG]
Creation dateDate the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes. [Creation Date]Find all clusters created in 2007: 2007[Creation Date]
Domain NameDomains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database. [Domain Name]Retrieve all clusters with an ATP-binding domain: ABC[Domain Name]
DomainsNumber of domains in the proteins cluster. [Domains]Retrieve all clusters with 3 domains: 3[Domains]
EC/RN NumberThe number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively. [EC/RN Number ]Retrieve all clusters containing the EC number 3.2.1.23: 3.2.1.23 [EC/RN Number]
FilterRetrieves clusters with a pre-selected property. [Filter]Retrieve all curated clusters: Curated[Filter]
Gene NameAbbreviated name for the gene. [Gene Name]Retrieve all lacZ clusters: lacZ[Gene Name]
Gene synonymAlternative name for gene found in the database records. [Gene Synonym]Retrieve all clusters with the cbiJ as a gene synonym: cbiJ[Gene Synonym]
HAMAPNumber assigned to designate a well-defined and well-conserved protein family or subfamily by the Swiss Institute for Bioinformatics. HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes. [HAMAP]Retrieve all clusters with the HAMAP MF_00008: MF_00008[HAMAP]
KONumber assigned to designate a manually curated set of orthologous gene groups in the complete genomes by the Kyoto Encyclopedia of Genes and Genomes. [KO]Retrieve all clusters with a KO of k01190: k01190 [KO]
Locus TagLocus tags are identifiers that are systematically applied to every gene in a genome. [Locus Tag]Retrieve all clusters with a protein with the locus tag of Z0440: Z0440[Locus Tag]
OrganismThe scientific and common names for the organisms associated with the protein sequence. [Organism]Find all projects associated with Escherichia coli: Escherichia coli[Organism]
ParalogsNumber of paralog proteins in a cluster. [Paralogs]Retrieve all clusters with 13 paralogs: 13[Paralogs]
PropertiesAn attribute of the cluster based on DNA source or curation status. [PROP][Properties]Retrieve all clusters from chloroplasts: source chloroplast[Properties]
Protein AccessionThe unique accession number of the protein. [Protein Accession]Retrieve all clusters containing the protein accession NP_414878: NP_414878 [Protein Accession]
Protein GI numberA series of digits that are assigned consecutively by NCBI to each sequence it processes. [Protein GI]Retrieve all clusters containing the protein GI of 16128329: 16128329 [Protein GI]
Protein NameThe standard name of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words. [PROT][Protein Name]Retrieve all clusters containing the protein beta galactosidase: beta galactosidase [Protein Name]
PubMed IDUnique identifier for the publication in the PubMed database. [PubMed ID]Retrieve all clusters with the PubMed ID number 97298: 97298 [PubMed ID]
Sequence LengthExact number of amino acids in the protein sequence. [Sequence Length]Retrieve all clusters with at least one protein with a length of 1024 amino acids: 1024 [Sequence Length]
SizeNumber of proteins in the cluster. [Size]Retrieve all clusters with 25 proteins: 25[Size]
Taxonomy IDIdentifier for the species or strain in the NCBI taxonomy database. [Taxonomy ID]Retrieve all clusters with proteins from taxonomy ID 83332: 83332 [Taxonomy ID]
TitleTitle of the protein cluster. [Title]Retrieve all beta-D-galactosidase protein clusters: beta-D-galactosidase [Title]
Total PublicationsTotal number of publications associated with proteins in the cluster. [Total Publications]Retrieve all clusters with 51 publications: 51[Total Publications]

Preview/Index

The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation.

The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.

If you have any additional questions, then please send an email to: info@ncbi.nlm.nih.gov

Constructing Powerful Queries

Combining search terms with limits and filters allows one to generate powerful queries. For example:

"Escherichia coli"[Organism] AND curated[filter] AND translation[COG Group]

This search finds all curated clusters that belong to the translation-related COG functional group (COG Group J) and are comprised of any RefSeq proteins encoded by genomes from Escherichia coli.

Query with Sequence

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure8.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure8.jpg.
Figure 8. Sequence similarity search against protein (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is figure8.jpg.

Figure 8. Sequence similarity search against protein clusters. A representative chloroplast protein (YP_319743 ribosomal protein S12) was used to search protein clusters by sequence similarity. A. Curated prokaryotic protein clusters are available using the CD-Search tool as one of the databases. Only protein sequences can be used in searches via CD-Search. B. Full results of CD-search against PRK showing 3 protein clusters, PRK05163 (bacterial ribosomal protein S12), CHL00051 (chloroplast ribosomal protein S12 – which contains YP_319743) and PRK04211 (archaeal ribosomal protein S12P). The top section shows a cartoon view of the alignment. Not the partial alignment against PRK04211. The bottom view has an expanded display of description and alignment information. C. Concise protein BLAST results. As described in the text, a representative protein is chosen from each subcluster at the level of genera and used in this BLAST database in order to speed up results and reduce redundancy. Proteins that are not members of a cluster are also added to this database. Score and E-value are available for the representative proteins or for the nonclustered proteins. The table shows the organism (taxonomy), protein name and accession (protein), locus_tag (gene), and cluster (protein clusters) names and links. Blink (pre-computed BLAST results) and BL2Seq (query and subject) results are available for further analysis. Concise protein BLAST can use both protein and nucleotide input sequences and the results can be reformatted to standard BLAST output.

Currently, there are two ways to query protein clusters: CD-Search and Concise Protein BLAST. Both methods have different resources and can be used for different purposes (Figure 8). Note: CD-Search can only be queried using a protein sequence, whereas both protein and nucleotide sequences can be used in Concise Protein BLAST.

CD-Search

Position-specific scoring matrices (PSSMs) are constructed from the alignments of each of the curated clusters. The PSSMs for curated prokaryotic clusters have been added to the CD-Search page and can be searched using RPS-BLAST (Figure 8 - select PRK from the database pull-down menu). More information on how to use CD-Search can be found on the CDD help page. The CD-Search page only allows searches with protein sequence queries. There may also be a slight delay in the updates of information in CD-Search as compared to Entrez Protein Cluster.

Concise Protein BLAST

The Concise Protein BLAST database consists of ALL prokaryotic protein clusters (curated and non-curated), as well as nonclustered proteins (Figure 8). However, the protein clusters are not in their raw form. Instead, each cluster has been sliced at the level of genera to provide “subclusters”. From each subcluster, a single protein representative has been chosen (randomly) and is used in the BLAST database. A single curated cluster, therefore, may comprise many subclusters, each with a representative. This reduces the level of redundancy when using BLAST, resulting in speedier searches and providing a broader taxonomic view than is typically found in BLAST results. The other proteins in the cluster are automatically linked to this representative and will also be found in the search results, although without the BLAST score and E-value because they are not specifically examined. All proteins that do not belong to the genus-level clusters are also added to the database for completeness.

Query Page

Queries can be either protein or nucleotide using blastp and blastx programs, respectively. Accessions, GIs, or sequences in FASTA format can be entered in the query box.

Default parameters are set below the query box. The expect (E-value) threshold is set low, which will help reduce the number of BLAST results. Information about each parameter is available by clicking on each name.

Results Page

The results page is not the one typically returned for BLAST results, although a link is provided to view the results in standard format (Figure 8).

The query is shown, along with the length, the number of hits for total proteins, and the proteins represented by the genus-level clusters. A link to each cluster (if it exists) for either curated or non-curated clusters is also provided. Those proteins that do not have a cluster link are singletons that do not exist in clusters (and would not be found in the Entrez Protein Clusters site).

Results are returned in a collapsed table format. Genus-level clusters are represented with a plus (+) sign at each level, which can be expanded. The table is sortable by organism name and by BLAST score. There is no BLAST score nor E-value for the other proteins in a cluster because they are not searched when a query is submitted since only one protein from a genus-level cluster is chosen.

BLAST Help

Help and information on BLAST are available from the main BLAST page.

Microbial genomes can also be searched using BLAST.

Release And Data Retrieval

Statistics

Summed statistics for the current protein clusters release and date are available on this webpage:

http://www.ncbi.nlm.nih.gov/genomes/prkstats.html

Note that some of the links in the table are dynamic searches to Entrez databases that may generate different totals than those shown when the search is performed.

FTP Files

After every public protein cluster release (approximately every 3 months) all data derived from clustering procedure, automated analyses and information addition, and curation are simultaneously released publicly in Entrez and on the FTP at this location:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/

Cluster releases are currently by month and year, and subdivided by organism group (PRK, CHL, MTH, etc.). Current information about the directory, file structure, and statistics is in the README document:

ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/README

The files include flatfiles, alignments, and PSSMs for curated clusters, and link lists for protein gi to cluster, taxonomy, cluster ID to PubMed ID, etc. The concise protein BLAST database is available. Large files are stored as tarballs (*.tgz files).

References

1.
Klimke W. et al. The National Center for Biotechnology Information's Protein Clusters Database. Nucleic Acids Res. 2009; 37(Database issue): D21623. [PubMed]
2.
Bao Y. et al. The influenza virus resource at the National Center for Biotechnology Information. J Virol. 2008; 82(2): 596601. [PubMed]

Help ǀ Contact Bookshelf
Protein Clusters Help
(navigation arrows) Go to previous chapter Go to next chapter Go to top of this page Go to bottom of this page Go to Table of Contents