Skip to main page content
Three major types of evidence are used by the Prokaryotic Genome Annotation Pipeline (PGAP) to assign names and other
attributes, such as gene symbols, publications, and EC numbers, to predicted proteins. They are Hidden Markov
Models (HMMs), BlastRules, and domain architectures.
These types of evidence are created based on the sequence similarity and structure of the protein family
they define. They are hierarchically organized according to their specificity and are assigned family
types, which depend on the diversity of the proteins in the family. For example, a broad-specificity HMM
of family type ‘domain’ typically hits a large number of proteins, usually with the same or similar domain
architectures but low overall sequence similarity. By contrast, an HMM of ‘subfamily’ type may hit fewer
proteins, with significant sequence similarity throughout their sequence and the same domain architecture.
Family types, and by extension naming evidence are assigned a precedence. If a protein is hit by
several evidence, it inherits the name and attributes from the evidence with the highest precedence. For
example, the RefSeq protein WP_000019730.1 is hit by three evidence,
the superfamily HMM TIGR01297.1 (product name: cation
diffusion facilitator family transporter), the BlastRuleException NBR007910 (product name: CDF family
zinc efflux transporter CzrB), and the domain architecture arch
11440813 (product name: cation transporter), however, it was named as ‘CDF family zinc efflux
transporter CzrB’ based on BlastRule NBR007910, which has the highest precedence.
Hidden Markov Models (HMMs)
An HMM-based protein family is a probabilistic model used to determine which proteins belong or don’t belong
to the family. To construct HMMs, multiple sequence alignments (seed alignments) of proteins of known
function are converted into a
position-specific scoring system to generate an HMM profile. Amino acids at each position on the seed
alignment are given a score according to their frequency. Sequence and domain cutoffs are established based
on the seed alignments and used as minimum thresholds for query proteins to be classified as members of the
HMM's protein family (See
how to build an HMM).
The HMMs used by PGAP come from a variety of sources. Some were built from scratch based on publications
documenting protein function (NCBIFAM), others were based on the NCBI protein clusters (PRKs). PGAP also uses TIGRFAMs (now owned by NCBI) as built by TIGR or
modified. A subset of Pfam HMMs to which NCBI associated a protein
product name is also used. Most Pfam HMMs are built to describe domains found within proteins rather than
the proteins themselves, and lack a curated product name, are considered provisional, and are therefore not
used for functional annotation by PGAP.
Each HMM used in PGAP is assigned an NCBI accession (“NF” prefix, or "TIGR" prefix for TIGRFAMs) with a
version that is incremented if the seed alignments or cutoffs for the HMM are modified. For HMMs that
originated from an outside source, the source identifiers ("TIGR" and "PF" identifiers for TIGRFAMs and
Pfams, respectively) are also provided in the RefSeq protein and the evidence records. For these HMMs, the
product names, cutoffs, or even seed alignments may differ from the values assigned originally by the
In PGAP, predicted proteins are matched to HMMs using the hmmsearch program in the HMMER software (V3.2.1).
A protein is considered a hit and assigned the product name and other attributes from the HMM if its
and domain scores are above the cutoffs defined for the HMM (See
how HMMs are used in protein annotation by PGAP). HMMs used in
PGAP for protein annotation are available on the NCBI’s ftp
BlastRules (identifiers starting with the “NBR” prefix) are a type of evidence for functional
classification of proteins based on BLAST (Basic
Local Alignment Search Tool). A BlastRule consists of one or more 'model' proteins with known biological
function, and BLAST identity and coverage cutoffs. Any protein aligning to a model protein above the cutoffs
is considered a BlastRule hit.
BlastRules are typically created for proteins which may play significant roles in virulence, antibiotic
resistance, evolution, and pathogenicity, as documented in scientific journals. Curators review the
literature to determine whether the biological function of studied proteins is conclusive and informative
enough for creating a BlastRule. The protein sequences cited in articles are retrieved from the database and
used as queries for BLAST searches in a database of proteins with known function. The identity and coverage
cutoffs of the BlastRule are determined based on the BLAST results, as well as phylogenetic analyses of the
During the PGAP annotation process, predicted proteins are searched against a collection of BlastRules using
the NCBI BLAST tool. A protein is considered as a BlastRule hit and assigned the product name and other
attributes from the BlastRule if its sequence exceeds the sequence identity and coverage cutoffs of the
BlastRule. BlastRules used in PGAP for protein annotation are available on the NCBI’s ftp site.
Proteins can be classified and grouped into evolutionarily conserved families based on their domain
architecture, the nature and order of conserved domain signatures identified along the sequence. Very often
such domain architectures are associated with a specific function. Conserved Domain Database (CDD) curation
staff maintains a comprehensive collection of common protein domain architectures, derived from pre-computed
annotation of proteins with domain footprints. Architectures with significant coverage are reviewed and
given names, with an emphasis on architectures prominent in bacteria. The Subfamily Protein Architecture
Labeling Engine (SPARCLE) is used by PGAP for the functional
characterization and naming of protein sequences that have been grouped by their characteristic
conserved domain architecture. Names derived from domain architectures are sometimes rather generic,
as domain architectures may encompass a variety of specific functions and/or functionally
uncharacterized proteins. Protein domain architectures and related information retrieval services are
maintained by the CDD/SPARCLE team at NCBI. Detailed information is available on the NCBI CDD web
Family types and order of precedence of the naming evidence
If a protein is hit by several evidence, it inherits the name and attributes from the evidence with the
highest-precedence family type. The various family types used in the evidence hierarchy and used for naming
proteins are defined below, from the highest to the lowest precedence:
BlastRuleIS (Transposase BlastRule)
BlastRuleIS is originally designed for transposases on insertion sequence (IS) elements with
nomenclature from ISFinder. However, the cutoffs for a
BlastRuleIS are stricter at their default levels (99% of sequence identity) than the protein percent
identity cutoff suggested by ISFinder (98%) (precedence score = 96).
BlastRuleException (Exception BlastRule)
A BlastRuleException is used to annotate a special group of proteins, which have a more specific
function in a protein family, such as listerolysin O, one of many named cholesterol-dependent
cytolysins. The identity and model protein coverage cutoffs of BlastRuleException are set as 94% and
90%, respectively (precedence score = 95).
An exception HMM recognizes proteins that share a specific chemical function, plus at least
one additional distinguishing feature such as having an extended region or belonging to a named
subclade. Examples of exception HMMs include specifically named isozymes that are expressed only for
pathways or biological processes (precedence score = 77).
An equivalog HMM recognizes groups of proteins that are homologous, and similar in domain
architecture, and consistent enough in their specific function that all can receive the same
functionally descriptive name. Equivalog proteins are presumed to have descended from a shared
ancestral protein that had the same function. If the member proteins of an equivalog are enzymes,
should share the same EC number (precedence score = 70).
Hypothetical equivalog HMM
Hypothetical equivalog HMMs are treated the same as Equivalog HMMs. Members of this HMM family are
expected have the same specific function, but it may not yet be known what the function is, and
member proteins consequently may be assigned rather vague-sounding names (precedence score = 70).
Equivalog domain HMM
Equivalog domain HMMS are treated the same as Equivalog HMMs. The region hit by the HMM is
considered sufficient for assigning member proteins a specific functional name, but domain
architecture is known to be variable among the proteins within the family (precedence score = 70).
Hypothetical equivalog domain HMM
Hypothetical equivalog domain HMMs are treated the same as Equivalog HMMs. The function is presumed
to be consistent for members of the family, but may not yet be known. Domain architecture may be
variable across the family, but the region described by the HMM belongs to a conserved core whose
presence is considered sufficient for naming member proteins (precedence score = 70).
BlastRuleEquivalog (Equivalog BlastRule)
An equivalog BlastRule resembles an equivalog HMM in design and purpose, but it receives a
slightly lower precedence score than that of an equivalog HMM. The percent identity cutoff of
BlastRuleEquivalog is set to 80% upon creation by default, and then may be adjusted (precedence
score = 69).
BlastRuleSubPlus (Subfamily-plus BlastRule)
This type of BlastRule (subfamily-plus) enforces nearly full-length alignment between a model
protein from the rule’s definition and the candidate protein that it matches. Rules of this type
provide names to rather narrowly defined protein subfamilies; “plus” means rules of this sort
out-rank both CDD domain architectures and subfamily HMMs (precedence score = 65).
There are two types of conserved domain architectures, superfamily and subfamily architectures.
Superfamily architectures consist solely of conserved domain superfamilies. This infers a general
functional category for proteins which have that architecture. Subfamily architectures either
contain a mix of conserved domain superfamilies and subfamilies or consist solely of conserved
domain subfamilies. Currently, only subfamily domain architectures are used for PGAP annotation of
proteins (precedence score = 60).
PfamEq (Pfam equivalog HMM)
Some Pfam HMMs hit exclusively proteins with a single named function, as HMMs built to find
equivalogs do. However, such models in Pfam tend to have permissive enough gathering thresholds that
additional proteins with only distant homology to the main cohort of proteins may score well enough
to be include, despite differing in function. Users should be wary of functional assignments made by
such HMMs when match scores, though above cutoff, are unusually low for the family (precedence score
A subfamily HMM hits collections of proteins that typically show nearly full-length homology, and
may share a general function (e.g. NAD-dependent oxidoreductase), but often vary in specific
function (precedence score = 55).
BlastRuleSubMinus (Subfamily-Minus BlastRule)
This infrequently used type of BlastRule enforces nearly full-length alignment between a model
protein from the rule’s definition and the candidate protein that it matches, but is assigned a low
precedence in annotation as if the name it applies is not very specific.
(precedence score = 50).
BlastRuleCOLLAB (Collaboration BlastRule)
BlastRuleCOLLABs are designed for rapid import of large numbers of BlastRules supplied by trusted
external contributors. Once entered into our evidence system, BlastRuleCOLLABs can be subjected to
additional testing and then promotion to different BlastRule types that have higher precedence
(precedence score = 41).
Computational analysis has suggested that this HMM, from the Pfam collection, behaves in certain
ways like an equivalog HMM, but the standard warning applies that Pfam HMMs typical have permissive
cutoffs set to help identify all homologs, rather than stringent cutoffs designed to exclude
homologs differing in function from the family that the model describes (precedence score = 37).
A few older TIGRFAMs models were built to describe protein families that were abundant in at least
narrow lineage, while rare or previously never seen outside that lineage. Proteins in these families
tend to be similar in length and align almost from end-to-end, and may contain recognizable homology
domains shared with proteins outside the family. The paralog HMM is thus a special case of the
subfamily HMM (precedence score = 35).
Superfamily HMMs hit collections of proteins that typically show nearly full-length homology, and
that in addition may be able to detect essentially all homologs, rather than just one clade from
such a collection of proteins. A superfamily HMM can encompass several different subfamilies
(precedence score = 33).
A domain is a localized region of sequence homology that is shared across proteins from different
families, whose other regions may be completely unrelated. Because HMMs that detect homology domains
find proteins that have a variety of different functions, and may describe only a small fraction of
proteins, domain HMMs name proteins in a fairly general way (i.e. NF023550) (precedence score =
Compared to a domain, repeat HMMs tend to describe even smaller regions, usually as multiple regions
arranged in tandem. A single repeat unit may be too small to fold independently. The small size of
repeat region, the correspondingly low cutoff scores necessitated by the small size, and the risk of
false-positive sequence matches give repeat HMMs a very low precedence not yet used in PGAP/RefSeq