Introduction
The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes, plasmids, viruses (including viruses), organelles, and complete and incomplete genomes of protozoa grouped and annotated based on sequence similarity and protein function (1). Proteins are automatically grouped into clusters based on reciprocal best-hit BLAST scores. The protein clusters database is updated at quarterly intervals whereupon validation and quality assessment processes occur. Therefore, there is a 3-month delay for the incorporation of proteins from new genomes into protein clusters because of the processing time necessary.
Clusters in the protein clusters database (protclustdb) are named and functional descriptions are assigned by manual curation. Alignments, information on genome neighborhood, and links to NCBI and external databases are provided for each protein cluster. Specific query and search terms can be found under Querying and Searching.
Proteins encoded by prokaryotes, plasmids, viruses, organelles, and complete and incomplete genomes of protozoa are contained within separate cluster groups. Each cluster in the database has a unique identifying number (UID). Each cluster also has an accession number consisting of a three or four-letter code followed by five numbers. Accession numbers are somewhat stable between releases unless the curated cluster is removed, split into subclusters, or joined with another curated cluster (this information is not tracked between releases).
| Protein Cluster Database | Cluster Number |
|---|
| Curated Prokaryotic Protein Clusters | PRK##### |
| Curated Chloroplast Protein Clusters | CHL##### |
| Uncurated Prokaryote Protein Clusters | CLS##### |
| Uncurated Chloroplast Protein Clusters | CLSC##### |
| Curated Mitochondrial Protein Clusters | MTH##### |
| Uncurated Mitochondrial Protein Clusters | CLSM##### |
| Curated Virus Protein Clusters | PHA##### |
| Uncurated Virus Protein Clusters | CLSP##### |
| Curated Protozoan Protein Clusters | PTZ##### |
| Uncurated Protozoan Protein Clusters | CLSZ##### |
The clusters are divided into curated and non-curated sets. Non-curated clusters are automatically generated and have not yet been manually annotated. Manual curation involves confirming protein domain structure, joining related clusters, addition of publications, and functional annotation. The status of curated clusters is provisional, validated, or reviewed depending upon the level of curation. Provisional status indicates minimal curation, validated status indicates a moderate level of curation, and reviewed status is applied to clusters after more extensive review. Validated and reviewed curated clusters have consistent nomenclature and protein function descriptions and are used in the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) as well as in cleanup of RefSeq records.
Cluster Creation
RefSeq Proteins from all complete prokaryotic genomes and plasmids (or, separately organelles, viruses, or protozoa) are compared using BLAST all-against-all. Protein clusters are created using a modified BLAST score that takes into account the length of the hit (alignment) versus both the query and the subject. The modified score is then used to numerically sort the BLAST results, and all proteins that are contained within the top hits are clustered together. Note: this procedure is likely to be modified in the future.
From the set of raw clusters, curators join and split clusters, and add annotation information (protein name, gene name, description), and publication links. The cluster annotation is occasionally propagated to every RefSeq protein that belongs to a specific cluster in order to generate more consistent protein and gene annotation on RefSeq records.
Entrez Protein Cluster Overview
Figure 1. Protein cluster overview. The main page for the protein clusters showing a representative protein cluster (PRK09525 - beta-galactosidase). Certain sections are collapsible for compact viewing, or expandable for a more detailed display (blue arrows). A. Core curated information includes the cluster number, curation status, protein name, and gene name. Additional descriptive curated information is found immediately below the blue bar as well as the enzyme commission number (cross reference section), and curated publications (publication section). B. Cluster info contains basic statistical information about the cluster, the identifier (UID), total proteins, genera, organisms, paralogs, and publications along with the taxonomic conservation level. C. Cluster tools provide analytical tools for the analyses of alignments, construction of phylogenetic trees, and examination of genomic neighborhoods in all sequences (ProtMap) or as conserved patterns. D. Cross references contain important links to other databases at NCBI or external databases. Specific links to other Entrez Databases are below the cross reference section. E. Domain description (CDD), COG category, and BRITE hierarchy (KEGG) provide more general information for a protein cluster. F. Publications are either curated, or collected automatically for proteins (RefSeq, SwissProt, structures, via sequence similarity), domains (CDD), and genes (GeneRIFs) and displayed by category in a semi-collapsed (only one publication per category) state which can be expanded to the full list, or the full list of publications can be displayed in PubMed. G. Top pattern links to the cluster pattern display.
The display (overview) for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively ().
Curation may involve joining or splitting clusters, or simply adding annotation information. Annotation includes the protein name, gene name and synonyms, a brief description, Enzyme Commission number, and curated publication links.
Automatically collected information is available for both curated and noncurated clusters and is displayed in the Cross Reference section and other sections of the overview. Domain information is collected from the NCBI Conserved Domain Database (CDD), functional classification is collected from the Cluster of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes orthologous clusters (KEGG - KO). Pathway and hierarchy information are also obtained from KEGG (BRITE hierarchy). Publication links are automatically collected and shown in the Publication Links section.
The information highlighted in blue at the top of the page contains the core curated identifiers and information (). The cluster accession and the curation status are at the left, the name of the cluster/protein in the middle, and the gene name or synonym (if annotated) at the right.
Below the core curated information is a description that contains curated information on the proteins and their function, domain descriptions,
COG functional categories, and
KEGG BRITE hierarchy showing functional classification ().
Cluster Info
Cluster Info () displays information specific to the cluster. On the left-hand side below the blue bar are some specific statistics about the protein cluster:
-
unique identifier
-
total proteins in the cluster
-
common taxonomic node
-
total number of genera
-
total number of organisms
-
putative number of paralogs (two or more proteins encoded by genes on the same nucleotide)
-
total number of publications
Cross References
Cross references () are calculated at the level of each protein (ex. domain assignment for a particular protein), collected from all proteins in a given cluster, and finally displayed in the cross-reference section, which provides links to similar clusters, information on protein families, metabolic roles, conserved protein domains, and protein structure. External links to these databases are also provided on the home page for
Entrez Protein Clusters.
Entrez Links
Entrez links () provide links to the Entrez Gene, Genome, Nucleotide, Protein, and PubMed records for the proteins in the cluster. Gene and Protein links directly links to the genes/proteins that comprise the Protein Cluster. Genome and Nucleotide link to the chromosomes, plasmids, and organellar genomes that encode these proteins. PubMed links to publications associated with that cluster. Further explanation of Entrez Links is available in the
Querying and Searching section.
Publication Links
Publication links () contain publications added during curation as well as publications collected from individual proteins, genes, structures, and domains. The sources include RefSeq genes and proteins (direct link to protein cluster members), GenBank, SwissProt, structural proteins, and Conserved Domains (via sequence similarity). Publications can be in more than one category but the total shown at the top is the complete set of non-redundant publications. The title from the most recent publication in each category is shown, along with links for each category. The entire publication set can be expanded to show all publication titles.
Cluster Patterns
Cluster Patterns are contiguous sets of clusters, curated or noncurated, conserved across multiple nucleotide sequences. The top pattern (containing the maximal protein count – maximum 13 clusters in the pattern for display purposes) is shown () and links to the Cluster Pattern display page
Protein Table
Figure 2. Protein table. A protein table from the main display page of a representative protein cluster (PRK09525 – beta-galactosidase). A. Column headings and controls. The headings from left to right are organism, protein name, previous cluster, accession, next cluster, locus_tag, Blink, and alignment. The controls under the organism include collapse (all organism groups will be collapsed), highlight paralogs (two or more proteins encoded by the same nucleotide sequence are considered paralogs) will highlight those with a yellow color (if any are present), and limit to paralogs will limit the entire table to only paralogs (if any). B. List of all proteins in the table grouped by organism group and shown alphabetically. Each row consists of one protein, and alternating rows are shaded for visualization. Each organism group can be collapsed (those with +/- beside them) individually or with the control in the top bar. Organisms can be selected as entire groups, or individually. Checking an organism will result in the selections being highlighted in some of the cluster tools (alignment, tree, ProtMap). Each column is a link to another Entrez database or tool: organism (taxonomy), protein name and accession (protein), and locus_tag (Gene). The diamond in the Blink column is a direct link to pre-computed protein-protein BLAST results (Blink) if any are available. The previous and next cluster links (name available on mouseover) represent the local genomic neighborhood for each nucleotide sequence. Genes in the 5’ or 3’ direction that encode proteins that belong to a cluster are shown color-coded according to COG functional category, with the cluster accession shown. Non-colored clusters indicate no COG functional category, and a blank space indicates a gene that either does not encode a protein belonging to a cluster, or that does not encode a protein at all (an RNA for example). The full genomic neighborhood is available in the cluster tools section (ProtMap). The alignment column is a simplistic view of the full alignment, showing aligned residues as gray blocks, and gaps as blank spaces. The conserved domain assignment is shown in the alignment view as colored bars below each sequence. Each colored bar is a different domain and the name is displayed during mouseover. Clicking on the domain will open that particular domain in CDD. The list of domains corresponds to those found in the cross reference section. If all proteins have identical domain assignments, then only the domain assignment on the top protein will be shown. All protein-domain assignments can be displayed with the expand/collapse control on the first protein. Protein sequences that are 100% identical will be boxed in the alignment column. Clicking anywhere in the alignment will open the alignment viewer on the position clicked on.
The protein table () contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus, the table can be collapsed or expanded. Selected sequences or groups at higher taxonomic levels can be highlighted in the alignment or distance tree by checking the box next to the genus or species.
The organism, protein name, accession, locus tag, and length are provided. To provide information on genome neighborhood, genes upstream and downstream of the gene encoding the protein are checked for cluster assignment. If these genes belong to a cluster then they are they displayed with their cluster assignment and are color-coded by COG functional category (genes that do not belong to a cluster are not shown). Blink is pre-computed blast results. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database or tool.
Querying and Searching
There are numerous ways to query protein clusters, either with search terms in Entrez or with a protein or nucleotide sequence using either CD-Search or Concise Protein BLAST.
Entrez Search
The Entrez Protein Clusters database uses all of the features of other Entrez databases. You can limit searches, Preview/Index your search terms, and use the History, Clipboard, or Details by using the tabs underneath the search box. General instructions on Entrez querying can be found in the Entrez help document.
As noted in the Entrez Links section, protein clusters currently link to Gene, Protein, Nucleotide, Genome, and PubMed. Links from these Entrez databases to protein clusters are also available via the links menu (Links in Entrez Gene). All protein clusters associated with a given genome can be found from the link to protein clusters from Entrez Genome or Nucleotide. Links to protein clusters for a specific organism can also be found in the protein view of Entrez Genome. Since many publication links are calculated by sequence similarity to proteins, related proteins not found in protein clusters can be retrieved this way.
Curated protein clusters are mirrored in the Conserved Domain Database (after a delay in data synchronization) and can be found using the Accession (ex. PRK00001), or as a text search for curated name (note this will also find related domains). The full texts of the descriptions are not available for search in CDD for protein clusters.
Limits
Figure 7. Protein clusters limits page. For searching the protein clusters showing the options to limit a search. A. Drop down menu for specific field to limit as described in the table. B. A series of checkboxes can be used to limit searches to status type, source, or by taxonomy.
The
Limits tab on the search bar allows search limits to be set from a drop-down menu ().
After selecting a limit, the selected field will show up in the yellow bar behind the Field tag. Searches can also be refined by checking the desired box(es) in the table to limit by curation status, source, or taxonomy. The Limits checkbox will also be marked and will remain through subsequent searches. To remove the limits for a particular search, deselect the checkbox.
The following table summarizes the various limits and properties that can be used to refine searches.
| Field name | Definition [including field abbreviations] | Examples |
|---|
| Accession | Unique identifier for each cluster.
[ACCN][ACCESSION] | Retrieve cluster with the accession PRK09525:
PRK09525 [ACCN] |
| Average Length | Average length of proteins in the cluster.
[Average Length] | Retrieve all clusters with an average protein length of 100–300 amino acids:
100:300[Average Length] |
| COG | COG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes.
[COG] | Retrieve all clusters with COG3250:
COG3250[COG] |
| Creation date | Date the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes.
[Creation Date] | Find all clusters created in 2007:
2007[Creation Date] |
| Domain Name | Domains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database.
[Domain Name] | Retrieve all clusters with an ATP-binding domain: ABC[Domain Name] |
| Domains | Number of domains in the proteins cluster.
[Domains] | Retrieve all clusters with 3 domains:
3[Domains] |
| EC/RN Number | The number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively.
[EC/RN Number ] | Retrieve all clusters containing the EC number 3.2.1.23:
3.2.1.23 [EC/RN Number] |
| Filter | Retrieves clusters with a pre-selected property.
[Filter] | Retrieve all curated clusters:
Curated[Filter] |
| Gene Name | Abbreviated name for the gene.
[Gene Name] | Retrieve all lacZ clusters:
lacZ[Gene Name] |
| Gene synonym | Alternative name for gene found in the database records.
[Gene Synonym] | Retrieve all clusters with the cbiJ as a gene synonym: cbiJ[Gene Synonym] |
| HAMAP | Number assigned to designate a well-defined and well-conserved protein family or subfamily by the Swiss Institute for Bioinformatics. HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes.
[HAMAP] | Retrieve all clusters with the HAMAP MF_00008:
MF_00008[HAMAP] |
| KO | Number assigned to designate a manually curated set of orthologous gene groups in the complete genomes by the Kyoto Encyclopedia of Genes and Genomes.
[KO] | Retrieve all clusters with a KO of k01190:
k01190 [KO] |
| Locus Tag | Locus tags are identifiers that are systematically applied to every gene in a genome.
[Locus Tag] | Retrieve all clusters with a protein with the locus tag of Z0440:
Z0440[Locus Tag] |
| Organism | The scientific and common names for the organisms associated with the protein sequence.
[Organism] | Find all projects associated with Escherichia coli: Escherichia coli[Organism] |
| Paralogs | Number of paralog proteins in a cluster.
[Paralogs] | Retrieve all clusters with 13 paralogs:
13[Paralogs] |
| Properties | An attribute of the cluster based on DNA source or curation status.
[PROP][Properties] | Retrieve all clusters from chloroplasts:
source chloroplast[Properties] |
| Protein Accession | The unique accession number of the protein.
[Protein Accession] | Retrieve all clusters containing the protein accession NP_414878:
NP_414878 [Protein Accession] |
| Protein GI number | A series of digits that are assigned consecutively by NCBI to each sequence it processes.
[Protein GI] | Retrieve all clusters containing the protein GI of 16128329:
16128329 [Protein GI] |
| Protein Name | The standard name of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words.
[PROT][Protein Name] | Retrieve all clusters containing the protein beta galactosidase:
beta galactosidase [Protein Name] |
| PubMed ID | Unique identifier for the publication in the PubMed database.
[PubMed ID] | Retrieve all clusters with the PubMed ID number 97298: 97298 [PubMed ID] |
| Sequence Length | Exact number of amino acids in the protein sequence.
[Sequence Length] | Retrieve all clusters with at least one protein with a length of 1024 amino acids:
1024 [Sequence Length] |
| Size | Number of proteins in the cluster.
[Size] | Retrieve all clusters with 25 proteins:
25[Size] |
| Taxonomy ID | Identifier for the species or strain in the NCBI taxonomy database.
[Taxonomy ID] | Retrieve all clusters with proteins from taxonomy ID 83332:
83332 [Taxonomy ID] |
| Title | Title of the protein cluster.
[Title] | Retrieve all beta-D-galactosidase protein clusters:
beta-D-galactosidase [Title] |
| Total Publications | Total number of publications associated with proteins in the cluster.
[Total Publications] | Retrieve all clusters with 51 publications:
51[Total Publications] |
Preview/Index
The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation.
The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.
If you have any additional questions, then please send an email to: info@ncbi.nlm.nih.gov
Constructing Powerful Queries
Combining search terms with limits and filters allows one to generate powerful queries. For example:
"Escherichia coli"[Organism] AND curated[filter] AND translation[COG Group]
This search finds all curated clusters that belong to the translation-related COG functional group (COG Group J) and are comprised of any RefSeq proteins encoded by genomes from Escherichia coli.
Query with Sequence
Figure 8. Sequence similarity search against protein clusters. A representative chloroplast protein (YP_319743 ribosomal protein S12) was used to search protein clusters by sequence similarity. A. Curated prokaryotic protein clusters are available using the CD-Search tool as one of the databases. Only protein sequences can be used in searches via CD-Search. B. Full results of CD-search against PRK showing 3 protein clusters, PRK05163 (bacterial ribosomal protein S12), CHL00051 (chloroplast ribosomal protein S12 – which contains YP_319743) and PRK04211 (archaeal ribosomal protein S12P). The top section shows a cartoon view of the alignment. Not the partial alignment against PRK04211. The bottom view has an expanded display of description and alignment information. C. Concise protein BLAST results. As described in the text, a representative protein is chosen from each subcluster at the level of genera and used in this BLAST database in order to speed up results and reduce redundancy. Proteins that are not members of a cluster are also added to this database. Score and E-value are available for the representative proteins or for the nonclustered proteins. The table shows the organism (taxonomy), protein name and accession (protein), locus_tag (gene), and cluster (protein clusters) names and links. Blink (pre-computed BLAST results) and BL2Seq (query and subject) results are available for further analysis. Concise protein BLAST can use both protein and nucleotide input sequences and the results can be reformatted to standard BLAST output.
Currently, there are two ways to query protein clusters:
CD-Search and
Concise Protein BLAST. Both methods have different resources and can be used for different purposes ().
Note: CD-Search can only be queried using a protein sequence, whereas both protein and nucleotide sequences can be used in Concise Protein BLAST.
CD-Search
Position-specific scoring matrices (PSSMs) are constructed from the alignments of each of the curated clusters. The PSSMs for curated prokaryotic clusters have been added to the
CD-Search page and can be searched using RPS-BLAST ( - select PRK from the database pull-down menu). More information on how to use CD-Search can be found on the
CDD help page. The CD-Search page only allows searches with protein sequence queries. There may also be a slight delay in the updates of information in CD-Search as compared to Entrez Protein Cluster.
Concise Protein BLAST
The
Concise Protein BLAST database consists of ALL prokaryotic protein clusters (curated and non-curated), as well as nonclustered proteins (). However, the protein clusters are not in their raw form. Instead, each cluster has been sliced at the level of genera to provide “subclusters”. From each subcluster, a single protein representative has been chosen (randomly) and is used in the BLAST database. A single curated cluster, therefore, may comprise many subclusters, each with a representative. This reduces the level of redundancy when using BLAST, resulting in speedier searches and providing a broader taxonomic view than is typically found in BLAST results. The other proteins in the cluster are automatically linked to this representative and will also be found in the search results, although without the BLAST score and E-value because they are not specifically examined. All proteins that do not belong to the genus-level clusters are also added to the database for completeness.
Query Page
Queries can be either protein or nucleotide using blastp and blastx programs, respectively. Accessions, GIs, or sequences in FASTA format can be entered in the query box.
Default parameters are set below the query box. The expect (E-value) threshold is set low, which will help reduce the number of BLAST results. Information about each parameter is available by clicking on each name.
Results Page
The results page is not the one typically returned for BLAST results, although a link is provided to view the results in standard format ().
The query is shown, along with the length, the number of hits for total proteins, and the proteins represented by the genus-level clusters. A link to each cluster (if it exists) for either curated or non-curated clusters is also provided. Those proteins that do not have a cluster link are singletons that do not exist in clusters (and would not be found in the Entrez Protein Clusters site).
Results are returned in a collapsed table format. Genus-level clusters are represented with a plus (+) sign at each level, which can be expanded. The table is sortable by organism name and by BLAST score. There is no BLAST score nor E-value for the other proteins in a cluster because they are not searched when a query is submitted since only one protein from a genus-level cluster is chosen.
Release And Data Retrieval
Statistics
Summed statistics for the current protein clusters release and date are available on this webpage:
http://www.ncbi.nlm.nih.gov/genomes/prkstats.html
Note that some of the links in the table are dynamic searches to Entrez databases that may generate different totals than those shown when the search is performed.
FTP Files
After every public protein cluster release (approximately every 3 months) all data derived from clustering procedure, automated analyses and information addition, and curation are simultaneously released publicly in Entrez and on the FTP at this location:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/
Cluster releases are currently by month and year, and subdivided by organism group (PRK, CHL, MTH, etc.). Current information about the directory, file structure, and statistics is in the README document:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/README
The files include flatfiles, alignments, and PSSMs for curated clusters, and link lists for protein gi to cluster, taxonomy, cluster ID to PubMed ID, etc. The concise protein BLAST database is available. Large files are stored as tarballs (*.tgz files).