NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Protein Clusters Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2007-.

Cover of Protein Clusters Help

Protein Clusters Help [Internet].

Show details

Protein Clusters: A Collection of Proteins Grouped by Sequence Similarity and Function

, Ph.D, , Ph.D, and , Ph.D.

Author Information
, Ph.D
, Ph.D
, Ph.D

Created: ; Last Update: January 25, 2010.


The NCBI Entrez Protein Clusters database is a collection of Reference Sequence (RefSeq) proteins from the complete genomes of prokaryotes, plasmids, viruses (including viruses), organelles, and complete and incomplete genomes of protozoa and plants grouped and annotated based on sequence similarity and protein function (1). Proteins are automatically grouped into clusters based on reciprocal best-hit BLAST scores. The protein clusters database is updated at quarterly intervals whereupon validation and quality assessment processes occur. Therefore, there is a 3-month delay for the incorporation of proteins from new genomes into protein clusters because of the processing time necessary.

Clusters in the protein clusters database (protclustdb) are named and functional descriptions are assigned by manual curation. Alignments, information on genome neighborhood, and links to NCBI and external databases are provided for each protein cluster. Specific query and search terms can be found under Querying and Searching.

Proteins encoded by prokaryotes, plasmids, viruses, organelles, and complete and incomplete genomes of protozoa and plants are contained within separate cluster groups. Each cluster in the database has a unique identifying number (UID). Each cluster also has an accession number consisting of a three or four-letter code followed by five numbers. Accession numbers are somewhat stable between releases unless the curated cluster is removed, split into subclusters, or joined with another curated cluster (this information is not tracked between releases).

Protein Cluster DatabaseCluster Number
Curated Prokaryotic Protein ClustersPRK#####
Uncurated Prokaryote Protein ClustersCLSK#####
Curated Chloroplast Protein ClustersCHL#####
Uncurated Chloroplast Protein ClustersCLSC#####
Curated Mitochondrial Protein ClustersMTH#####
Uncurated Mitochondrial Protein ClustersCLSM#####
Curated Virus Protein ClustersPHA#####
Uncurated Virus Protein ClustersCLSP#####
Curated Protozoan Protein ClustersPTZ#####
Uncurated Protozoan Protein ClustersCLSZ#####
Curated Plant Protein ClustersPLN#####
Uncurated Plant Protein ClustersCLSN#####

The clusters are divided into curated and non-curated sets. Non-curated clusters are automatically generated and have not yet been manually annotated. Manual curation involves confirming protein domain structure, joining related clusters, addition of publications, and functional annotation. The status of curated clusters is provisional, validated, or reviewed depending upon the level of curation. Provisional status indicates minimal curation, validated status indicates a moderate level of curation, and reviewed status is applied to clusters after more extensive review. Validated and reviewed curated clusters have consistent nomenclature and protein function descriptions and are used in the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) as well as in cleanup of RefSeq records.

Cluster Creation

RefSeq Proteins from all complete prokaryotic genomes and plasmids (or, separately organelles, viruses, plants or protozoa) are compared using BLAST all-against-all. Protein clusters are created using a modified BLAST score that takes into account the length of the hit (alignment) versus both the query and the subject. The modified score is then used to numerically sort the BLAST results, and all proteins that are contained within the top hits are clustered together. Note: this procedure is likely to be modified in the future.

From the set of raw clusters, curators join and split clusters, and add annotation information (protein name, gene name, description), and publication links. The cluster annotation is occasionally propagated to every RefSeq protein that belongs to a specific cluster in order to generate more consistent protein and gene annotation on RefSeq records.

Entrez Protein Cluster Overview

The display (overview) for each cluster provides information on cluster accession, cluster name, and gene name, as well as links to protein display tools, external databases, and publications. Each of these sections can be expanded or collapsed by clicking on the down or right arrows, respectively (Figure 1).

Figure 1. . Protein cluster overview.

Figure 1.

Protein cluster overview. The main page for the protein clusters showing a representative protein cluster (PRK09525 - beta-galactosidase). Certain sections are collapsible for compact viewing, or expandable for a more detailed display (blue arrows). A. (more...)

Curation may involve joining or splitting clusters, or simply adding annotation information. Annotation includes the protein name, gene name and synonyms, a brief description, Enzyme Commission number, and curated publication links.

Automatically collected information is available for both curated and noncurated clusters and is displayed in the Cross Reference section and other sections of the overview. Domain information is collected from the NCBI Conserved Domain Database (CDD), functional classification is collected from the Cluster of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes orthologous clusters (KEGG - KO). Pathway and hierarchy information are also obtained from KEGG (BRITE hierarchy). Publication links are automatically collected and shown in the Publication Links section.

The information highlighted in blue at the top of the page contains the core curated identifiers and information (Figure 1A). The cluster accession and the curation status are at the left, the name of the cluster/protein in the middle, and the gene name or synonym (if annotated) at the right.

Below the core curated information is a description that contains curated information on the proteins and their function, domain descriptions, COG functional categories, and KEGG BRITE hierarchy showing functional classification (Figure 1E).

Cluster Info

Cluster Info (Figure 1B) displays information specific to the cluster. On the left-hand side below the blue bar are some specific statistics about the protein cluster:

  • unique identifier
  • total proteins in the cluster
  • common taxonomic node
  • total number of genera
  • total number of organisms
  • putative number of paralogs (two or more proteins encoded by genes on the same nucleotide)
  • total number of publications

Cross References

Cross references (Figure 1D) are calculated at the level of each protein (ex. domain assignment for a particular protein), collected from all proteins in a given cluster, and finally displayed in the cross-reference section, which provides links to similar clusters, information on protein families, metabolic roles, conserved protein domains, and protein structure. External links to these databases are also provided on the home page for Entrez Protein Clusters.

Related Clusters

Related Clusters (protein clusters related by sequence similarity) are shown in a line below the publication links: from left to right from most to least similar. Similar clusters are calculated from the complete set of all clusters from all organisms and taxonomic groups, including curated and noncurated clusters.

The ranking of related clusters is calculated from the average BLAST score (average bitscore). The average score is the sum of all scores for protein pairs over the sum of all pairs between the full set of proteins in both clusters. Related clusters with the highest score are shown on the left. The top ten (ranked by score) related clusters are displayed (if available) and mouseover displays the curated name, and the individual cluster is available by clicking on the link. All related clusters are accessible from the link displaying the total number of related clusters. Clusters are color-coded by COG functional categories.

Cluster Patterns

Cluster Patterns are contiguous sets of clusters, curated or noncurated, conserved across multiple nucleotide sequences. The top pattern (containing the maximal protein count – maximum 13 clusters in the pattern for display purposes) is shown (Figure 1G) and links to the Cluster Pattern display page

Protein Table

The protein table (Figure 2) contains information on each of the proteins in the cluster, organized by taxon. Where there are multiple species or strains from the same genus, the table can be collapsed or expanded. Selected sequences or groups at higher taxonomic levels can be highlighted in the alignment or distance tree by checking the box next to the genus or species.

Figure 2. . Protein table.

Figure 2.

Protein table. A protein table from the main display page of a representative protein cluster (PRK09525 – beta-galactosidase). A. Column headings and controls. The headings from left to right are organism, protein name, previous cluster, accession, (more...)

The organism, protein name, accession, locus tag, and length are provided as is a link to the SwissProt accession. To provide information on genome neighborhood, genes upstream and downstream of the gene encoding the protein are checked for cluster assignment. If these genes belong to a cluster then they are they displayed with their cluster assignment and are color-coded by COG functional category (genes that do not belong to a cluster are not shown). Blink is pre-computed blast results. Alignment provides a graphic of the alignment including domain structure. Each of these properties provides a link to the appropriate database or tool.

Cluster Tools

Cluster Tools (Figure 1C) contains several methods of displaying cluster details or more detailed analysis including multiple alignments, phylogenetic trees, and local genomic neighborhood (ProtMap and cluster patterns).

Sequence And Phylogenetic Analysis

Show Detailed Alignment

Show detailed alignment displays a scrollable, multiple sequence alignment (Figure 3). Selected sequences in the alignment may be highlighted by checking the box next to the species or genus on the protein table (Figure 2B) before displaying the alignment. Amino acids in the alignment display can be shown as either a consensus (only amino acids that differ from the consensus sequence are shown – Figure 3C) or by conservation (conserved amino acids highlighted by color – Figure3B). The amino acid coloration scheme is based on the physical and/or chemical properties of the amino acids. Similarities are highlighted if at least 80% of residues in a column are identical or fall into at least one of the following amino acid groups: (FHWY); aliphatic (ILVA); hydrophobic (ACFILMVWY); alcohol (STC); charged (DEHKR); positive (HKR); negative (DE); polar (CDEHKNQRST); tiny (AGS); small (ACDGNPSTV); or bulky (EFIKLMQRWY). Features are shown in the alignment for critical residues obtained from CDD (via structure) and Conserved Domains can be displayed by clicking the Show/hide all in the Domains box or, for any individual sequence, by clicking the + or − next to the protein accession.

Figure 3. . Detailed alignment.

Figure 3.

Detailed alignment. Aligned proteins in a representative cluster (PRK13343 – F0F1 ATP synthase subunit alpha). A. The cluster accession and presentation mode are in the top row, while the second row shows cluster name, number of proteins, number (more...)

Build Tree

Build tree produces a distance tree based on the multiple alignment (Figure 4). The tree construction method can be chosen by using the drop-down menus. Different distance measures are available for calculating protein pairwise distances either using an mPam amino acid substitution matrix which reflects evolutionary bias of amino acid substitutions, the Kimura approximation for protein sequences implemented in the Flu database, or a Poisson model of amino acid substitutions which assumes an equal rate of substitution between amino acids and among the lineage (2). Tree construction uses either a fast minimum evolution which estimates the length of any given topology and then selects the topology with the shortest length or Neighbor-Joining where at each step, a pair with a smallest value of Dij - bi - bj is chosen, where Dij is the distance between nodes i and j, and bi = ∑kn Dij /(n-2). The distance between the new node u and each of remaining nodes is defined as Duk = (Dik + Djk - Dij ) /2. Branch lengths are defined as vui = (Dij + bi - bj ) /2 and vuj = (Dij + bj - bi ) /2 (negative lengths are truncated to zero).

Figure 4. . Phylogenetic tree.

Figure 4.

Phylogenetic tree. Tree visualization of a representative cluster (PRK05159 – aspartyl-tRNA synthetase) created using Fast Minimum Evolution and Kimura distances. The tree has been collapsed at the level of genus, and some manipulations have been (more...)

Genomic Neighborhood

Two tools allow an analysis of the genomic neighborhood: ProtMap and cluster patterns (Figure 1).

Genome ProtMap

Genome ProtMap (Figure 5) displays a 10-kb region surrounding (both directions) all of the proteins in the cluster (Genome ProtMap by cluster accession) or in all the proteins that have the same Cluster of Orthologous Group - COG (Genome ProtMap by COG#####). Although the ProtMap may appear to be an alignment, it is not. Instead, all of the genes encoding the proteins that are members of the cluster are drawn in the same 5’ – 3’ direction in the center of the display. Genome ProtMap is not available for plants or protozoa

Figure 5. . ProtMap.

Figure 5.

ProtMap. A representative protein cluster is displayed with the ProtMap tool (PRK05159 – aspartyl-tRNA synthetase - the same cluster shown in Figure 2) showing local genomic neighborhood centered on a set of related genes (via the proteins encoded (more...)

Cluster Patterns

Cluster Patterns (Figure 6) are contiguous sets of clusters, curated or noncurated, conserved across multiple nucleotide sequences. The Cluster Pattern display shows a taxonomically compressed version of that shown in ProtMap but with a potentially broader set of clusters along a set of nucleotide sequences. The patterns are derived in the following manner.

Figure 6. . Cluster pattern.

Figure 6.

Cluster pattern. A representative cluster pattern for PRK13024 (bifunctional SecD/SecF). A. The top bar shows the cluster number, curated name, number of proteins, and a download option for a tab-delimited table. The same two links that are available (more...)

Each pattern must have > 3 clusters

Each pattern must have > 2 proteins

Each protein belongs only to a single pattern


If six genomes contain genes that encode proteins belonging to clusters in the following manner (A, B, and C are taxonomic nodes, genes are genes along the chromosome, and CL# are cluster numbers associated with each protein encoded by the genes):
Image helpcluster-Image001.jpg

Then the resulting patterns will be:
Image helpcluster-Image002.jpg

Patterns are collected in a maximum 80 000 bp window centered on the gene encoding the protein which is a member of the cluster. Patterns contain maximum number of proteins (ie. not max size along the genome, but max size in terms of conservation – even if the final pattern is smaller as a result). Currently this procedure may result in non-optimal solutions and some clusters may not have sufficient conservation to generate any patterns at all.

The top pattern (containing the maximal protein count – maximum 13 clusters in the pattern for display purposes) is shown above the protein table on the overview page (Figure 1G). All of the patterns from a given cluster are displayed on the Cluster Pattern tool display.

Querying and Searching

There are numerous ways to query protein clusters, either with search terms in Entrez or with a protein or nucleotide sequence using either CD-Search or Concise Protein BLAST.


The Limits tab on the search bar allows search limits to be set from a drop-down menu (Figure 7).

Figure 7. . Protein clusters limits page.

Figure 7.

Protein clusters limits page. For searching the protein clusters showing the options to limit a search. A. Drop down menu for specific field to limit as described in the table. B. A series of checkboxes can be used to limit searches to status type, source, (more...)

After selecting a limit, the selected field will show up in the yellow bar behind the Field tag. Searches can also be refined by checking the desired box(es) in the table to limit by curation status, source, or taxonomy. The Limits checkbox will also be marked and will remain through subsequent searches. To remove the limits for a particular search, deselect the checkbox.

The following table summarizes the various limits and properties that can be used to refine searches.

Field nameDefinition [including field abbreviations]Examples
Accession Unique identifier for each cluster.
Retrieve cluster with the accession PRK09525:
PRK09525 [ACCN]
Average Length Average length of proteins in the cluster.
[Average Length]
Retrieve all clusters with an average protein length of 100–300 amino acids:
100:300[Average Length]
COGCOG (Clusters of Orthologous Groups) is a phylogenetic classification of proteins from complete genomes.
Retrieve all clusters with COG3250:
Creation dateDate the record was created. Note the format is: YEAR/MONTH/DAY including the forward slashes.
[Creation Date]
Find all clusters created in 2007:
2007[Creation Date]
Domain NameDomains are structural or functional units in a protein; nomenclature is based on the NCBI Conserved Domain Database.
[Domain Name]
Retrieve all clusters with an ATP-binding domain: ABC[Domain Name]
DomainsNumber of domains in the proteins cluster.
Retrieve all clusters with 3 domains:
EC/RN NumberThe number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively.
[EC/RN Number ]
Retrieve all clusters containing the EC number [EC/RN Number]
FilterRetrieves clusters with a pre-selected property.
Retrieve all curated clusters:
Gene Name Abbreviated name for the gene.
[Gene Name]
Retrieve all lacZ clusters:
lacZ[Gene Name]
Gene synonym Alternative name for gene found in the database records.
[Gene Synonym]
Retrieve all clusters with the cbiJ as a gene synonym: cbiJ[Gene Synonym]
HAMAP Number assigned to designate a well-defined and well-conserved protein family or subfamily by the Swiss Institute for Bioinformatics. HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes.
Retrieve all clusters with the HAMAP MF_00008:
KO Number assigned to designate a manually curated set of orthologous gene groups in the complete genomes by the Kyoto Encyclopedia of Genes and Genomes.
Retrieve all clusters with a KO of k01190:
k01190 [KO]
Locus TagLocus tags are identifiers that are systematically applied to every gene in a genome.
[Locus Tag]
Retrieve all clusters with a protein with the locus tag of Z0440:
Z0440[Locus Tag]
OrganismThe scientific and common names for the organisms associated with the protein sequence.
Find all projects associated with Escherichia coli: Escherichia coli[Organism]
Paralogs Number of paralog proteins in a cluster.
Retrieve all clusters with 13 paralogs:
Properties An attribute of the cluster based on DNA source or curation status.
Retrieve all clusters from chloroplasts:
source chloroplast[Properties]
Protein Accession The unique accession number of the protein.
[Protein Accession]
Retrieve all clusters containing the protein accession NP_414878:
NP_414878 [Protein Accession]
Protein GI number A series of digits that are assigned consecutively by NCBI to each sequence it processes.
[Protein GI]
Retrieve all clusters containing the protein GI of 16128329:
16128329 [Protein GI]
Protein Name The standard name of proteins found in database records. Common names may not be indexed in this field so it is best to also consider All Fields or Text Words.
[PROT][Protein Name]
Retrieve all clusters containing the protein beta galactosidase:
beta galactosidase [Protein Name]
PubMed ID Unique identifier for the publication in the PubMed database.
[PubMed ID]
Retrieve all clusters with the PubMed ID number 97298: 97298 [PubMed ID]
Sequence Length Exact number of amino acids in the protein sequence.
[Sequence Length]
Retrieve all clusters with at least one protein with a length of 1024 amino acids:
1024 [Sequence Length]
Size Number of proteins in the cluster.
Retrieve all clusters with 25 proteins:
Taxonomy ID Identifier for the species or strain in the NCBI taxonomy database.
[Taxonomy ID]
Retrieve all clusters with proteins from taxonomy ID 83332:
83332 [Taxonomy ID]
Title Title of the protein cluster.
Retrieve all beta-D-galactosidase protein clusters:
beta-D-galactosidase [Title]
Total Publications Total number of publications associated with proteins in the cluster.
[Total Publications]
Retrieve all clusters with 51 publications:
51[Total Publications]


The Preview/Index page on any Entrez database is used to construct queries and to view terms that have been indexed under any field name. The table in the previous section described the fields used in indexing the records and provided some representative queries using those fields. Information on using Preview/Index can be found in the Entrez help documentation.

The History, Clipboard, and Details features are consistent with other Entrez databases. You may find additional information in the Entrez help documentation.

If you have any additional questions, then please send an email to: vog.hin.mln.ibcn@ofni

Constructing Powerful Queries

Combining search terms with limits and filters allows one to generate powerful queries. For example:

"Escherichia coli"[Organism] AND curated[filter] AND translation[COG Group]

This search finds all curated clusters that belong to the translation-related COG functional group (COG Group J) and are comprised of any RefSeq proteins encoded by genomes from Escherichia coli.

Query with Sequence

Currently, there are two ways to query protein clusters: CD-Search and Concise Protein BLAST. Both methods have different resources and can be used for different purposes (Figure 8). Note: CD-Search can only be queried using a protein sequence, whereas both protein and nucleotide sequences can be used in Concise Protein BLAST.

Figure 8. . Sequence similarity search against protein clusters.

Figure 8.

Sequence similarity search against protein clusters. A representative chloroplast protein (YP_319743 ribosomal protein S12) was used to search protein clusters by sequence similarity. A. Curated prokaryotic protein clusters are available using the CD-Search (more...)


Position-specific scoring matrices (PSSMs) are constructed from the alignments of each of the curated clusters. The PSSMs for curated prokaryotic clusters have been added to the CD-Search page and can be searched using RPS-BLAST (Figure 8 - select PRK from the database pull-down menu). More information on how to use CD-Search can be found on the CDD help page. The CD-Search page only allows searches with protein sequence queries. There may also be a slight delay in the updates of information in CD-Search as compared to Entrez Protein Cluster.

Concise Protein BLAST

The Concise Protein BLAST database consists of ALL protein clusters (curated and non-curated), as well as nonclustered proteins (Figure 8). However, the protein clusters are not in their raw form. Instead, each cluster has been sliced at the level of genera to provide “subclusters”. From each subcluster, a single protein representative has been chosen (randomly) and is used in the BLAST database. A single curated cluster, therefore, may comprise many subclusters, each with a representative. This reduces the level of redundancy when using BLAST, resulting in speedier searches and providing a broader taxonomic view than is typically found in BLAST results. The other proteins in the cluster are automatically linked to this representative and will also be found in the search results, although without the BLAST score and E-value because they are not specifically examined. All proteins that do not belong to the genus-level clusters are also added to the database for completeness.

Query Page

Queries can be either protein or nucleotide using blastp and blastx programs, respectively. Accessions, GIs, or sequences in FASTA format can be entered in the query box.

Default parameters are set below the query box. The expect (E-value) threshold is set low, which will help reduce the number of BLAST results. Information about each parameter is available by clicking on each name.

Results Page

The results page is not the one typically returned for BLAST results, although a link is provided to view the results in standard format (Figure 8).

The query is shown, along with the length, the number of hits for total proteins, and the proteins represented by the genus-level clusters. A link to each cluster (if it exists) for either curated or non-curated clusters is also provided. Those proteins that do not have a cluster link are singletons that do not exist in clusters (and would not be found in the Entrez Protein Clusters site).

Results are returned in a collapsed table format. Genus-level clusters are represented with a plus (+) sign at each level, which can be expanded. The table is sortable by organism name and by BLAST score. There is no BLAST score nor E-value for the other proteins in a cluster because they are not searched when a query is submitted since only one protein from a genus-level cluster is chosen.


Help and information on BLAST are available from the main BLAST page.

Microbial genomes can also be searched using BLAST.

Release And Data Retrieval


Summed statistics for the current protein clusters release and date are available on this webpage:

Note that some of the links in the table are dynamic searches to Entrez databases that may generate different totals than those shown when the search is performed.

FTP Files

After every public protein cluster release (approximately every 3 months) all data derived from clustering procedure, automated analyses and information addition, and curation are simultaneously released publicly in Entrez and on the FTP at this location:

Cluster releases are currently by month and year, and subdivided by organism group (PRK, CHL, MTH, etc.). Current information about the directory, file structure, and statistics is in the README document for each release:

The files include flatfiles, alignments, and PSSMs for curated clusters, and link lists for protein gi to cluster, taxonomy, cluster ID to PubMed ID, etc. The concise protein BLAST database is available. Large files are stored as tarballs (*.tgz files).


Klimke W., et al. The National Center for Biotechnology Information's Protein Clusters Database. Nucleic Acids Res. 2009;37(Database issue):D216–23. [PMC free article: PMC2686591] [PubMed: 18940865]
Bao Y., et al. The influenza virus resource at the National Center for Biotechnology Information. J Virol. 2008;82(2):596–601. [PMC free article: PMC2224563] [PubMed: 17942553]
Bookshelf ID: NBK3797


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this page (982K)

Other titles in this collection

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...