Diversity, classification and evolution of CRISPR-Cas systems
Abstract
The bacterial and archaeal CRISPR-Cas systems of adaptive immunity show remarkable diversity of protein composition, effector complex structure, genome locus architecture and mechanisms of adaptation, pre-CRISPR (cr)RNA processing and interference. The CRISPR-Cas systems belong to two classes, with multi-subunit effector complexes in Class 1 and single-protein effector modules in Class 2. Concerted genomic and experimental efforts on comprehensive characterization of Class 2 CRISPR-Cas systems led to the identification of two new types and several subtypes. The newly characterized type VI systems are the first among the CRISPR-Cas variants to exclusively target RNA. Unexpectedly, in some of the class 2 systems, the effector protein is additionally responsible for the pre-crRNA processing. Comparative analysis of the effector complexes indicates that Class 2 systems evolved from mobile genetic elements on multiple, independent occasions.
Introduction
Thanks to the unprecedented utility of the engineered Cas9 endonucleases harnessed as genome editing tools, comparative genomics, structures as well as biochemical activities and biological functions of CRISPR-Cas systems and individual Cas proteins have become a subject of intensive study [1–4]. The CRISPR-Cas is a bona fide adaptive (acquired) immunity system with immune memory (Lamarckian-type inheritance) that is stored in the form of spacer sequences derived from foreign genomes and inserted into CRISPR arrays. CRISPR-Cas is a programmable form of immunity that can adapt to target any sequence and therefore is not under pressure to evolve an immense diversity of specificities as is the case for restriction-modification enzymes, the most abundant form of innate immunity in prokaryotes [5]. Nevertheless, like any defense system, CRISPR-Cas is engaged in an incessant arms race with viruses, which results in rapid evolution of cas genes [6] and notable diversity of the gene repertoires and architectures of the CRISPR-cas loci translating into diversification of the actual defense mechanisms [7,8]. More specifically, the diversification of the CRISPR-Cas systems is likely to be partly driven by their competitive coevolution with virus-encoded dedicated anti-CRISPR proteins [9–12]
Despite this extensive diversification, comprehensive comparative analysis has revealed major unifying themes in the evolution of CRISPR-Cas. These common trends include multiple contributions of mobile genetic elements to the emergence of the CRISPR-Cas immunity and its distinct variants; extensive duplication of cas genes yielding functionally versatile effector complexes; and modular organization, with extensive recombination of the modules [4,8,13]. The two principal modules of the CRISPR-Cas systems consist of the suites of genes encoding proteins involved in adaptation (spacer acquisition) and effector functions, i.e. pre-crRNA processing, and target recognition and cleavage. The adaptation module is largely uniform across the diversity of the CRISPR-Cas systems and consists of the endonuclease Cas1 and the structural subunit Cas2 [14]. In contrast, the effector modules are highly variable between CRISPR-Cas types and subtypes. Various proteins involved in ancillary roles, such as regulation of the CRISPR response and other, still poorly characterized functions, can be assigned to a third, accessory module [4,8,13].
In this brief review, we discuss the recent advances in the characterization of the CRISPR-Cas diversity, which result in rapid expansion of the CRISPR-Cas classification, the distinct structural and functional features of the newly discovered CRISPR-Cas variants, and the emerging evolutionary insights.
Exploration of CRISPR diversity and the evolving classification of CRISPR-Cas systems
The fast evolution and variability of the CRISPR-Cas systems makes their classification a daunting task. Given the absence of universal cas genes and the frequent modular recombination, a single classification criterion is impractical. Therefore, a multipronged approach to CRISPR-Cas classification has been adopted that takes into account the signature cas genes that are specific for individual types and subtypes of CRISPR-Cas, sequence similarity between multiple shared Cas proteins, the phylogeny of Cas1 (the best conserved Cas protein), the organization of the gene in the CRISPR-cas loci and the structure of the CRISPR themselves [8,15]. The combined application of these criteria resulted in the currently adopted classification scheme that partitions the CRISPR-Cas systems into two distinct classes characterized by different principles of the effector module design (Figure 1).
(A) Class 1
(B) Class 2
The organization of the CRISPR-cas loci and domain architectures of the effector proteins as well as the (predicted) target (DNA or RNA, or both) are shown for each subtype. For subtype III-D, a locus with reverse transcriptase fused to cas1 is included; other reverse transcriptase-containing variants, from subtypes III-A and III-D, are not shown. The genes for Class 1 effector subunits are shaded. SS, small subunit; TM, predicted transmembrane segment.
The Class 1 systems include the most common and diversified type I, type III that is represented in numerous archaea but is less frequent in bacteria, as well as the rare type IV that includes rudimentary CRISPR-cas loci lacking the adaptation module (Figure 1a). The effector complexes of type I and type III CRISPR-Cas display elaborate architectures, with a backbone consisting of paralogous RAMPs (Repeat-Associated Mysterious Proteins), such as Cas7 and Cas5, containing the RRM (RNA Recognition Motif) fold and additional ‘large’ and ‘small’ subunits [16–23] (Figure 2). These effector complexes contain one Cas5 subunit and several Cas7 subunits. The complex accommodates the guide RNA that consists of the spacer and a portion of a repeat. The Cas5 subunit binds the 5′ handle of the crRNA and interacts with the large subunit (Cas8 in type I and Cas10 in type III). The small subunit is typically present in several copies and interacts with the crRNA backbone bound to Cas7. Notably, the length of the bound spacer correlates with the number of Cas7 subunits in the backbone of the complex [24–26]. Although the sequences of the protein subunits in type I and type III effector complexes show little sequence similarity to each other, the presence of the homologous RAMPs in both types of complexes and the overall structural similarity revealed by cryo-electron microscopy (Figure 2) leave little doubt in the common origin of these complexes [8,15,19]. An additional RAMP, Cas6, is loosely associated with the effector complex and typically functions as the repeat-specific RNase in the pre-crRNA processing [27,28].
The effector modules of Class 2 consist of a single, large, multidomain protein each, such that the respective CRISPR-cas loci have a much simpler and more uniform organization than those of Class 1 (Figure 1). At the time of the construction of the current formal CRISPR-Cas classification [8], Class 2 included 3 subtypes of the thoroughly characterized type II, with the effector endonuclease Cas9, the widely used genome editing enzyme, and type V, with the predicted effector protein Cpf1. Type V was included into the classification scheme for the first time and remained experimentally uncharacterized. The subsequent demonstration that Cpf1 was indeed an active RNA-guided endonuclease that, unlike Cas9, did not require the additional trans-activating CRISPR (tracr)RNA for target cleavage [29], raised interest in comprehensive characterization of the diversity of the Class 2 CRISPR-Cas systems by genome and metagenome mining.
To this end, a computational pipeline was constructed that i) identified CRISPR-cas loci by using either the highly conserved cas1 gene or the CRISPR array as an anchor, ii) discarded those loci that could be classified into one of the known subtypes using the previously developed automatic approach, and iii) selected those that encoded a large protein, a putative Class 2 effector, for detailed, case by case analysis [30]. Running this pipeline against all available bacterial and archaeal genomes along with the long metagenomics contigs resulted in the identification of 3 new subtypes of type V, which all contain a RuvC-like endonuclease domain, and 3 subtypes of type VI, each with two HEPN domains predicted to possess RNase activity (Figure 1b) [31]. The interference activities of the effectors of type V-B (Cas12b) [30] type VI-A (Cas13a) [32] and type VI-B (Cas13b) [33] effectors have been validated experimentally.
The design of the computational pipeline for Class 2 ensured effectively complete detection of all variants existing in the searched databases, up to the sensitivity of the anchor (cas1 gene or CRISPR array) detection. Indeed, all CRISPR-Cas loci that did not belong to the already identified subtypes but contained putative Class 2 effectors were analyzed in detail and classified accordingly [31]. Obviously, however, new genomes, especially those representing microbial groups not sampled previously, have the potential to yield new variants. Two such variants have been discovered in a recent search of genomes from uncultivated bacteria and can be classified as new type V subtypes [34]. The predicted effectors of these new Class 2 variants contain the RuvC-like domain but no other detectable nuclease domains and have been originally denoted CasX and CasY but, according to the Cas protein nomenclature, should become, respectively, Cas12e and Cas12d, the latter related to the previously identified Cas12c (Figure 1).
Type II and type V effectors: functional convergence vs structural diversity
Comparative-genomic discovery of the new Class 2 CRISPR-Cas systems has been rapidly followed by structural and experimental characterization of their predicted effectors (Table 1). The unifying feature of all type II and type V subtypes is the presence of a RuvC-like endonuclease domain that adopts the RNase H fold (Figures 1 and and3).3). However, the other parts of the Cas9 and Cas12 (the new designation for the type V effectors) proteins from different subtypes show no similarity to one another at the sequence or structure levels. Moreover, the similarity between the RuvC-like domain sequences of different subtypes is low so that they can be recognized as homologs only through the comparison of the conserved catalytic motifs and structures (Figure 3) [30,31]. The structures of several Cas9 proteins [35–38] as well as structures of Cas12a (Cpf1) from three bacteria [39–42] and two structures of Cas12b (C2c1) [43,44] complexed with the guide RNA and target DNA have been solved. Comparison of these structures reveals a remarkable interplay of convergence and diversity. All three effectors are of closely similar size and share similar bilobed, ‘jaw-like’ shapes (Figure 3) but the structures could not be superimposed apart from the RuvC-like domains [45]. The nuclease activity of the RuvC-like domains of Cas9 is complemented by that of the unrelated HNH family nuclease that is inserted into the RuvC-like domain (Figure 3) and cleaves the target (complementary to the guide) strand of the substrate DNA [46,47]. Structural analysis of the CRISPR-Cas9 R-loop containing complex with the target DNA has shown that Cas9 induces the structural distortion of the DNA helix that is required for R-loop formation, which then causes a conformation change in Cas9, such that the RuvC and HNH catalytic sites are positioned in proximity of the displaced and target strands, respectively [48].
(A) Schematic representation of the complexes of effector proteins with the target DNA or RNA, guide RNA and (for type II) tracrRNA
(B) Domain architectures of the effector proteins and their transposon-encoded evolutionary ancestors
IscB and TnpB are the inferred ancestors of the type II (Cas9) and type V (Cas12) effectors, respectively. The catalytic residues of the effector nuclease domain and, for Cas12a and Cas13a, the residues shown to be required for pre-crRNA processing are indicated in red. The Protein Data Bank (PDB) codes are included for proteins with solved structures. HTH, helix-turn-helix DNA-binding domain. The tracrRNA, the pre-crRNA processing catalytic sites and the nicking, target strand-cleaving nuclease are denoted by asterisks to indicate that they are each present only in subsets of the type II and type V effectors 9see text for details). The catalytic amino acid residues of the target-cleaving nucleases are shown in red, and those of the pre-crRNA processing nuclease are shown in blue. RuvCI, II and III stand for the three conserved motifs of the RuvC-like nuclease domain that contain the catalytic residues. In the motif signature, “x” stands for amino acid, and “..” indicated that the catalytic residues are separated by a small, variable number of non-conserved residues.
Table 1
Properties of experimentally characterized Class 2 CRISPR-Cas systems
| Type/Subtype/effector | Nuclease domains | Target | Cut structure | tracrRNA requirement | PAM/PFS1 |
|---|---|---|---|---|---|
| II/Cas9 | RuvC, HNH | dsDNA | blunt | Yes | 3′ GC-rich PAM |
| V-A/Cas12a(Cpf1) | RuvC, NUC(?); uncharacterized domain for pre-crRNA processing | dsDNA | Staggered, 5′-overhangs | No | 5′ AT-rich PAM |
| V-B/Cas12b(C2c1) | RuvC | dsDNA | Staggered, 5′-overhangs | Yes | 5′ AT-rich PAM |
| VI-A/Cas13a(C2c2) | 2xHEPN; uncharacterized domain for pre-crRNA processing | ssRNA | Guide-dependent RNA cuts + collateral RNA cleavage | No | 3′PFS: non-G |
| VI-B/Cas13b | 2xHEPN | ssRNA | Guide-dependent RNA cuts + collateral RNA cleavage | No | 5′PFS: non-C; 3′PFS: NAN/NNA |
The Cas12a protein lacks the HNH domain but contains a distinct domain with a unique fold inserted in a similar but not identical position within the RuvC-like domain (Figure 3) [45]. Mutation of the catalytic residues of the RuvC-like domain in Cas12a of the bacterium Acidaminococcus sp. completely prevented the cleavage of both strands, whereas mutation of some of the conserved residues in the insert domain specifically and completely inhibited cleavage of the target strand [40]. These findings and the spatial proximity of this domain to the target strand suggest that the insert domain of Cas12a (denoted the Nuc domain) was responsible for the target strand cleavage [40].
The RuvC-like domain of Cas12b (C2c1) also contains an insert domain with a unique fold that has been denoted Nuc by analogy with Cas12a and because some mutation in this domain substantially reduced the target strand cleavage efficiency [43]. However, it has been shown that, in the case of Cas12b, both target and non-target DNA strands are accommodated and cleaved by the catalytic site of the RuvC-like domain, which undergoes a major conformational change triggered by the initial, non-target strand cleavage [44]. In a recent analysis of the structure and activity of Cas12a from the bacterium Francisella novicida, mutations in the Nuc domain only partially inactivated the target strand cleavage, leading to the suggestion that, analogously to Cas12b, Cas12a cleaved both strands of the target DNA via the RuvC-like domain, whereas the Nuc domain only contributed to the guide and target binding [42]. The details of the target recognition and cleavage by different type V effectors and the contributions of different domains remain to be thoroughly investigated. Nevertheless, the available data clearly indicate that, in type II and type V effectors of different subtypes, similar functional solutions have evolved via distinct routes.
Type VI: dedicated RNA-targeting CRISPR-Cas systems with toxin properties
In all type VI effectors (Cas13), the only recognizable features are two HEPN domains that, to an even greater extent than the RuvC-like domains in Cas9 and Cas12, show extreme sequence divergence such that their identification required extensive manual comparison [30,33]. These observations are in line with the previous analyses showing that the HEPN superfamily consists of extremely diverged RNases that are involved mostly in various defense-related functions in both prokaryotes and eukaryotes [49]. In particular, the HEPN RNase is the toxin domain of numerous bacterial and archaeal toxin-antitoxin modules. The discovery of the putative effectors containing HEPN domains prompted the prediction that type VI includes dedicated RNA-targeting CRISPR-Cas systems [30] (RNA targeting has been demonstrated for certain type III systems but these target DNA as well [50–52]). This prediction has been validated for both type VI-A and type VI-B effectors (Cas13a and Cas13b, respectively) that have been characterized as programmable, RNA-guided RNase that efficiently protect Escherichia coli from the RNA bacteriophage MS2 infection [32,33]. Analysis of the Cas13a structure confirmed the presence of the two HEPN domains and revealed yet another bilobed structure that generally resembles those of the type II and type V effectors (Figure 3) [53].
In the case of type VI-B, a remarkable feature, so far not observed for other CRISPR-Cas systems, has been discovered: the activity of Cas13b proteins is either up- or down-regulated by accessory small Cas proteins encoded by the respective type VI-B variants [33] (Figure 1b).
Strikingly, once activated by the cognate target RNA, Cas13a and Cas13b become promiscuous RNases that cleave RNA in a non-sequence-specific manner [32,33]. Furthermore, expression of the Cas13 proteins jointly with the guide and target RNAs leads to cellular toxicity, suggestive of induction of dormancy or programmed cell death in bacteria [32]. This phenomenon seems to be compatible with the previously proposed hypothesis on coupling between immunity and dormancy induction and/or programmed cell death (PCD) in prokaryotes [54,55]. Given that RNA viruses have a narrow host range and make up but a tiny part of the prokaryotic virosphere, it seems most likely that type VI systems target DNA viruses by cleaving viral transcripts and then inducing dormancy or PCD in response to intense expression of viral genes.
Pre-crRNA processing: multitasking Class 2 effectors
Most of the experimentally studied CRISPR arrays are expressed as a single, long transcript, known as pre-crRNA and are subsequently processed into mature crRNAs via mechanisms that differ between types and subtypes [27]. Pre-crRNA processing in type II CRISPR-Cas systems has been characterized in detail. In this case, pre-crRNA cleavage is catalyzed by RNase III, a ubiquitous bacterial enzyme that is encoded outside the CRISPR-cas locus, and requires tracrRNA, a distinct RNA species that is partially complementary to the CRISPR [27]. An alternative mechanism in some type II systems involves independent transcription of individual crRNAs [56]. Neither type V-A nor type VI systems require tracrRNA, and it has been shown that the corresponding effectors, Cas12a and Cas13a, are themselves fully responsible for pre-crRNA processing [53,57–59]. The catalytic sites involved in crRNA maturation have been at least partially identified and shown to be distinct from those involved in target cleavage but could not be associated with any distinct domains [53,57,58] (Figure 3). The structures of Cas12a and Cas13a show no similarity, apart from the bilobed shape (Figure 3), implying that the pre-crRNA processing activities have evolved independently in types V and VI. Type V-B systems require tracrRNA for crRNA maturation and target cleavage [30,43,44], but the mechanism of processing has not been established. It is thus a question of major interest whether pre-crRNA processing in type V-B involves RNase III or is catalyzed by Cas12b itself, or involves some other mechanism.
Origin and evolution of CRISPR-Cas: the pivotal contributions of mobile genetic elements
Over the last few years, comparative genomic analysis has shed light on the origins and evolution of CRISPR-Cas systems. Examination of the genomic context of cas1 homologs that are not associated with CRISPR-cas loci led to the discovery of a novel superfamily of self-synthesizing transposons, the Casposons, named after the Cas1 homolog they encode and have been predicted to employ as integrase (recombinase) [60,61]. The integrase activity of the Casposon-encoded Cas1 (renamed casposase) has been validated experimentally [62], and similar target site specificities of Casposon integration and CRISPR spacer incorporation have been demonstrated [63]. Although the currently identified Casposons do not encode Cas2, some encode Cas4 and additional nucleases [61]. It seems likely that the entire adaptation module and perhaps even additional Cas proteins were contributed to the emerging CRISPR-Cas system by the ancestral Casposon [64]. Furthermore, the prototype CRISPR repeats and the leader sequence could have originated from a duplicated target site of the casposon [65].
The ancestry of the effector module is far less clear. Given the wide spread existence of Class 1 CRISPR-Cas systems among archaea and bacteria, contrasted by the much lower abundance of the Class 2 systems, the multisubunit effector complexes of Class 1 are the likely ancestral form [30]. Given that the core subunits of Class 1 effector complexes contain diverged version of the RRM, it appears likely that the effector module evolved by serial duplication of an ancestral RRM protein, with subsequent diversification driven by the host-parasite arms race[13]. The ultimate ancestor of the core Cas proteins could have been a RRM domain-containing nuclease, such as Cas10, that gave rise to the extant multitude of active and inactivated paralogs. The ancestral effector complex, prior to the recruitment of the Casposon, could function as an RNA-guided innate immunity system [64], perhaps similarly to the extant Argonaute-centered immunity [66].
The provenance of the Class2 effector module is much clearer. Surprisingly, the sequences of the RuvC-like domains of Cas9 and Cas12 are more similar to the sequences of transposon-encoded TnpB proteins than they are to each other. The tnpB genes are highly abundant in bacterial and archaeal genomes and belong to autonomous but more frequently non-autonomous transposons. The role of TnpB in the transposon life cycle remains unclear given that this protein is not required for transposition [67] but the conservation of the RuvC-like endonuclease catalytic sites in most TnpB sequences indicates that these proteins are active nucleases. Remarkably, Cas9 and Cas12a, 12b and 12c showed the highest similarity to different groups of TnpB proteins, suggesting independent origins for the effectors of different types and subtypes of Class 2. The ancestry of Cas9 could be readily traced to a distinct family of transposons (hence denoted ISC, after Insertion Sequences Cas9-related) thanks to the shared domain architectures of the IscB and Cas9 proteins (Figure 3), in which an HNH endonuclease domain is inserted into the RuvC-like domain [68,69]. The specific ancestry of Cas12a–c remained murky due to low sequence conservation [30].
The line of descent of type V effectors from TnpB was clarified with the identification of subtype V-U (after Uncharacterized) loci using CRISPR arrays as the anchor for the computational Class 2 discovery pipeline [31]. Unlike the other Class 2 effectors, which are giant proteins with limited similarity to TnpB within the RuvC-like domains, type V-U loci encode TnpB homologs that are either the same size as transposon-encoded TnpB or slightly larger (Figure 1), and show highly significant similarity to the latter. The high similarity of the putative V-U effectors to TnpB enabled reliable phylogenetic analysis, which supported independent origin of Class 2 effectors from different TnpB subfamilies. The functionality of the type V-U systems remains to be demonstrated but at least some of them are likely to be active because the respective CRISPR arrays contain spacers homologous to phage genome sequences. The current evolutionary scenario for type II and type V effectors includes multiple insertions of non-autonomous TnpB-encoding transposons next to CRISPR arrays, followed by convergent ‘domestication’ of TnpB involving acquisition of additional domains [31]. These domains are unrelated in different subtypes, but in each case, enable the effector proteins to accommodate the crRNA guide complexed with the target DNA (Figure 3).
As with the type V effectors, the exact ancestry of the type VI, HEPN-containing effector proteins is difficult to pinpoint, but HEPN-domain containing Cas proteins of Class 1 CRISPR-Cas systems, such as Csm6 and Csx1, that target viral transcripts appear plausible candidates [31,70]. Ultimately, there is little doubt that these Cas proteins evolved from HEPN-containing RNase toxins. Thus, the evolutionary history of Class 2 systems is largely a story of the second, after the Casposons, major contribution of mobile elements to the evolution of the CRISPR-Cas adaptive immunity. Of all, apparently independent cases of recruitment of nucleases from mobile elements giving rise to Class 2 systems, the capture of IscB that yielded type II was by far the most successful one: taken together, all other variants in Class 2 area about an order of magnitude less abundant among bacteria than type II [31]. Whether the type II systems provide some distinct advantage to bacteria or are for some reason more prone to horizontal gene transfer than other Class 2 systems, is an intriguing question for future work on CRISPR biology.
Conclusions
The recent progress in the study of the diversity of CRISPR-Cas systems has led to the discovery of not only novel effector protein and locus architectures but also fundamentally new molecular mechanisms. These include the exclusive RNA-targeting by type VI systems that appear to directly couple CRISPR immunity with dormancy induction or PCD; pre-crRNA processing by the effector proteins of types V-A and VI-A; differential regulation type VI-B effectors by small Cas proteins; and also the use of RNA for adaptation via CRISPR-associated reverse transcriptases by some type III systems [71]. There is little doubt that more functional diversity remains to be discovered, in particular, with regard to the mechanisms of regulation of CRISPR-Cas activity and coupling of immunity with dormancy induction. During the short time interval of only about 2 years, focused study of Class 2 systems has revealed a broad spectrum of effector nuclease activities that offer ample choice of functionalities for genome editing and regulation tools (Table 1). Furthermore, the parallel discoveries of Casposons and the type V-U loci suggests the possibility of experimental investigation of different stages in CRISPR-Cas evolution that can shed light on the transition from mobile genetic elements to an adaptive immunity system.
Footnotes
Author declaration
We confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.




