![]() |
Formats:
|
||||||||||||||||||||||
Copyright © The Author(s) 2009 A clonotype nomenclature for T cell receptors 1Molecular Genetics Laboratory, Blood Research Institute, BloodCenter of Wisconsin, Milwaukee, WI 53226 USA 2Department of Pathology, University of Massachusetts Medical School, Worcester, MA 01655 USA 3Department of Public Health and Family Medicine, Tufts University School of Medicine, Boston, MA 02111 USA Maryam B. Yassai, Phone: +1-414-9373824, Fax: +1-414-9376284, Email: maryam.yassai/at/bcw.edu. Corresponding author.Received May 20, 2009; Accepted June 15, 2009. Abstract T cell receptor (TCR) nucleotide sequences are often generated during analyses of T cell responses to pathogens or autoantigens. The most important region of the TCR is the third complementarity-determining region (CDR3) whose nucleotide sequence is unique to each T cell clone. The CDR3 interacts with the peptide and thus is important for recognizing pathogen or autoantigen epitopes. While conventions exist for identifying the various TCR chains, there is a lack of a concise nomenclature that would identify both the amino acid translation and nucleotide sequence of the CDR3. This deficiency makes the comparison of published TCR genetic and proteomic information difficult. To enhance information sharing among different databases and to facilitate computational assessment of clonotypic T cell repertoires, we propose a clonotype nomenclature. The rules for generating a clonotype identifier are simple and easy to follow, and have a built-in error-checking system. The identifier includes the V and J region, the CDR3 length as well as its human or mouse origin. The framework of this naming system could also be expanded to the B cell receptor. Electronic supplementary material The online version of this article (doi:10.1007/s00251-009-0383-x) contains supplementary material, which is available to authorized users. Keywords: TCR, Nomenclature, CDR3, Clonotype Introduction A hallmark of immunity is the intrinsic ability to recognize and eliminate foreign molecules, cells, and organisms. The adaptive immune system is comprised of B and T cells. During T and B cell development these cells express unique heterodimeric receptors that can be used in pathogen recognition. Each of these receptor chains is generated by a somatic rearrangement process that joins different segments of the TCR and BCR genes and creates a novel gene. This joining process is imprecise with insertion of non-templated nucleotides (N nucleotides) in the junction site, as well as 3′- and 5′-nucleotide deletion from the germline genes participating in the rearrangement. This region of random nucleotide insertion or deletion referred to as the third complementarity-determining region (CDR3). The resulting CDR3 have a unique nucleotide sequence that is specific to that particular B or T cell and all its progeny; hence, the clonotypic nature of the receptors. The CDR3 is the portion of these receptors that is most involved in interactions with intact soluble antigens (B cells) or intracellular processed antigens presented as immunogenic peptides loaded in MHC molecules (T cells) The initial phase of the adaptive immune responses involves B and T cell clonal selection on the basis of the structural complementarity of antigen-specific receptors to pathogen-derived epitopes (Davis and Chien 2003; Kolar and Capra 2003). The cells recruited into the immune response execute their effector function role. After pathogen clearance, a proportion of these cells will be retained as memory. Memory provides more rapid and effective immune protection against recurring pathogen present in the environment. The collection of cells that respond to a particular pathogen is referred to as the repertoire. T and B cells can also be implicated in responses to non-pathogenic environmental stimuli (allergies). More serious is the lack of tolerance to self that results in responses to self-antigens giving rise to autoimmune disease. In each case, a repertoire of allergen- or self-specific B or T cells is generated. The repertoire recognizing a molecule would be a sum of the repertoires responding against all the component epitopes of the molecule. The repertoire against an organism would be the sum of all the repertoires against all the molecules from the pathogen. Measuring an immune response at the level of the repertoire is becoming very common (Correia-Neves et al. 2001; La Gruta et al. 2008; Naumov et al. 1998; Pewe et al. 2004; Probert et al. 2007; Venturi et al. 2008). An antigen-specific response can be viewed in the context of how many T cells are recruited and the structure of their antigen receptors. The nature of the naïve and antigen-experienced cells repertoire is of interest in basic and clinical immunology, immune-pharmaceutics, and vaccine development. However, comparison of datasets from similar, or even identical, experiments from different laboratories is cumbersome due to lack of the unified clonal identification procedure where the clonotypic antigen-receptor serves a marker of clonal identity. Having a quick way to assign specific identifiers for specific receptor sequence would facilitate such comparison studies. There are two subsets of T cells based on the exact pair of receptor chains expressed. These are either the alpha (α) and beta (β) chain pair, or the gamma (γ) and delta (δ) chain pair, identifying the αβ or γδ T cell subsets, respectively. The expression of the β and δ chain is limited to one chain in each of their respective subsets and this is referred to as allelic exclusion (Bluthmann et al. 1988; Uematsu et al. 1988). These two chains are also characterized by the use of an additional DNA segment, referred to as the diversity (D) region during the rearrangement process. The D region is flanked by N nucleotides which constitutes the NDN region of the CDR3 in these two chains. The CDR3 of each of the two receptor chains defines the clonal specificity. For αβ T cells the CDR3 is in most contact with the peptide bound to the MHC (Rudolph et al. 2006). For this reason, CDR3 sequences have been the main focus for sequencing studies. In the past three decades, TCR clone sequences have been presented in publications in many different forms. Some, using an alias as an identifier and present a whole nucleotide sequence of a clone by identifying the V, D, and J segments (Elliott et al. 1988). In some publications, the information about the V and the J usage and the amino acids of the V/NDN/J junction sequences (Kent et al. 2005) are given, while in other publications, both nucleotides and amino acid sequences of all different segments that have been recombined to make up the CDR3 region of the TCR clones are given (Maslanka et al. 1996; Naumov et al. 1998; Shin et al. 2005). However, a full sequence could be quite bulky. Often, for simplicity, each sequence is assigned its own alias that could be a number or a combination of letters and numbers to ease the tracking of information (Cameron et al. 2002; Chien et al. 1987; Correia-Neves et al. 2001; Davis and Bjorkman 1988; Elliott et al. 1988; Kalams et al. 1994; La Gruta et al. 2008; Lehner et al. 1995; McHeyzer-Williams and Davis 1995; Naumov et al. 1998, 2006; Pewe et al. 2004; Venturi et al. 2008). With the arrival of new ultra-high throughput or massively parallel sequencing techniques these data sets are bound to grow larger. Without a proper standardization, the general compilation of such information across published and documented data sources is problematic. Thus, there is a need for a nomenclature which allows to properly enumerating the TCR chains and tracing them to the T cell clones. The primary purpose of this naming system is to have a unique identifier for the CDR3 of each TCR chain, so that information about the T cell clones in publications, databases, and other forms of communication can be unambiguously associated with the correct T cell clone. The proposed nomenclature is intended to provide the immunology community an easy route to share genetic information about clonal and clonotypic T cell receptors. Materials and methods T cell clonotypes To properly document and enumerate TCR CDR3, we have developed a working definition of a clonotype and a nomenclature that reflects the sequence information of the CDR3 of that particular receptor:
The clonotype nomenclature The clonotype nomenclature refers to a system of names that are fully controlled through explicit and rigid syntactic rules. We have also defined a list of desired features for a clonotype identifier that allows computational assignment. To minimize the identifier length and to maintain clarity a mix of letters and digits is used. There is a firm restriction on the use of capitalization and character formatting in a formal name. For the clonotype nomenclature, lowercase letters are reserved for amino acid sequences in the V and J regions and uppercase letters are reserved for amino acid sequences in the region between the V and J. Thus, the uppercase corresponds to amino acids encoded by the N or NDN regions. To make names fully computable, we are avoiding the use of subscripts, superscripts, accents, and word separators; Greek symbols are replaced by uppercase Roman letters; the period (‘.’) is used as a symbol separator. The clonotype-naming process The name contains information on amino acid sequence originating from V, J, and NDN regions. The name consists of five segments: (1) CDR3 amino acid identifier, (2) CDR3 nucleotide sequence identifier, (3) variable (V) segment identifier, (4) joining (J) segment identifier, and (5) CDR3 length identifier. The name can be constructed and deconstructed in the same manner. Access to a standard genetic code table and the germline configuration of the V and J regions identified in the name is all that is needed to reconstruct the actual nucleotide sequence of the clonotype. The rules for clonotype naming are as follows:
This segment uses the one-letter code and always starts and ends with a lowercase letter. The starting lowercase letter represents the last amino acid from the V segment which is completely (all three nucleotides) encoded from the V region. The final lowercase letter represents the first amino acid entirely encoded by the J region. Uppercase letters represent amino acids that are encoded fully or in part by the NDN region.
A series of digit numbers (ID) with a leading period for a symbol separator is reserved for a nucleotide identifier. Each digit in this number reflects the specific codon for each uppercase amino acid in the name. These numbers are not limited and appear in the same order as their amino acid counterparts. The identifier assignment is based on the standard codon table (Table 1). The codons for each amino acid are numbered sequentially from top to bottom and then across and down for the six codon amino acids. The three termination codons are assigned “O” and numbered 1 for Ochre (TAA), 2 for Amber (TAG), and 3 for Opal (TGA). The letter “O” is chosen because two of the three terminators start with this letter and no amino acid is associated with this letter.
The V gene family (also referred to as group) is identified by an uppercase Roman letter followed by a specific subfamily (also referred to as subgroup) identifier. In order to sort the clonotypes based on the V gene usage we assign a fixed number of characters for the V gene subfamilies. The names of the human V gene subfamilies are as originally described by Hood and colleagues (Rowen et al. 1996). Each V gene is assigned a subfamily number (two digits) followed by S and another number to define the subfamily member. In the case of TCR AV and TCR BV genes, some subfamilies have more than one member. The members are identified by S1, S2, S3, … for human and −1, −2, −3, … for mouse. The names of the mouse V genes are based on the ImMunoGeneTics (IMGT) database (Giudicelli et al. 2005). For mouse distal V alpha genes that are repeats of the proximal ones, we omitted the “–” in the name to keep the total characters to five, similar to that for human V genes. The breakdown of the assigned characters is shown in Table 2.
The identification of a subfamily member from a TCR sequence focused on the CDR3 depends on the specificity of the V region primer and the sequence homology of the subfamily members in the DNA segment 3′ of the V primer. If the primer is specific enough to distinguish a specific subfamily member, the clonotype name will have the specific subfamily member’s name. If the primer pairs to the region that all subfamily members have identical sequences, then the DNA sequence 3′ of the primer will determine the TCRV name. If all subfamily members have identical sequence for this region, the subfamily member’s name will be SX for human, and −X for mouse. If some family members can be defined but others cannot, the indistinguishable subfamilies are referred to using Y and Z. The possible members that comprise Y and Z should be further explained. Some AV genes can rearrange to either alpha J genes (resulting in a TCR alpha chain) or delta J genes (resulting in a TCR delta chain). These are called ADV genes. Based on their location in the AV locus region, we simplify the nomenclature by using the alpha gene name. The J region identifier then specifies to which constant region the VA is linked. Shown in Table 3 are the human and mouse alpha/delta genes and our corresponding nomenclature. There are two genes that do not follow this rule; the human delta V1 gene which is located between the AV23 and AV24 genes only rearranges to the delta J genes and yet has not been found rearranging to alpha J genes, and mouse AV15-2/DV6-2 and AV15D-2/DV6-2 genes are similar and yet have not been found rearranging to the alpha J genes. In our naming system, the delta V name will be used for these genes; D1 for human and D6-2 for mouse.
Allelic forms of V regions exists (http://imgt.cines.fr/textes/IMGTrepertoire/Proteins/#B). Currently, the clonotype nomenclature does not account for these. They could be identified by enlarging the V region identifier by one or two characters. The need for this level of characterization is unclear at this time so the identifier has been kept shorter for sake of usability.
The J gene identifier appears after the V gene identifier. The J gene family is expressed by Roman letters as defined for the V gene identifier above. Human (Rowen et al. 1996) and mouse (Giudicelli et al. 2005) J genes are named as described. For sake of brevity and to facilitate sorting, the “S” for designation of subfamily members in human and the “−” for designation of the subfamily members for mouse is dropped, resulting in a two-digit number. The detail of J character assignment is shown in Table 4. It should be noted that there are five subfamily in human gamma J family; 1, 2, P, P1, and P2, that two of the subfamilies have been identified by assigned numbers (gamma J1 and gamma J2), one has a been identified by assigned letter (gamma JP) and two have been identified by a letter and a number (gamma JP1 and gamma JP2). In order to have the same characters for all human gamma J genes, we are assigning a number to the ones that do not have a number identifier as follows; GJP = GJ3, GJP1 = GJ4, and GJP2 = GJ5.
There are a number of alleles of J regions that have been reported (Lefranc and Lefranc 2001 & http://imgt.cines.fr/textes/IMGTrepertoire/Proteins/#B). Currently, the nomenclature does not take these into account. If needed, the J identifier could be extended by one character to include an allele identifier.
The length of the clonotype is determined by the number of amino acids between the C-terminal-conserved cysteine (C) of the V region, and phenylalanine (F) of the J region which is part of the FG×GT conserved motif in all J regions. The C and the F are not counted in the length. The number representing the length is preceded by a letter L that serves as a symbol separator. In order to sort the clonotypes based on their length, we assigned three characters for the length, the first being the letter “L”, followed by the two digits specifying the length. Results and discussion Generating a TCR clonotype identifier The use of the nomenclature is demonstrated for a TCR β-chain clonotype from our studies of CD8 T cells from HLA-A2.1 individuals responding against the influenza A matrix protein M1-derived peptide, M158–66 (Fig. 1
V region nomenclature For TCRAV and TCRBV, some BV subfamilies have more than one member. The identification of the subfamily members depends on two factors. The first is the specificity of the V region primer that is used for amplifying the particular V subfamily member. Primers could be designed that are specific for only one subfamily member. If the primer is specific enough to anneal only to one of the V subfamily members, then the clonotype identifier will use the subfamily member’s name such as S1, S2, S3 (for human), and −1, −2, −3 (for mouse). The second factor is the sequence homology between the subfamily members in the region 3′ of the V primer up to the conserved cysteine, the nucleotide differences downstream the conserved cysteine is not considered due to possibility of excision during the rearrangement process. For some choices of primer, there may be sufficient differences in the region between the primer and the conserved cysteine that the particular subfamily member can be identified. If this is the case, then the name of subfamily member is used. In other cases, the sequence between the V primer and the conserved cysteine is associated with multiple sequences. We reserve the letter X for use if the primer does not allow any distinction of subfamily members. Y and Z can be used to designate subsets of possible subfamily members and these must be defined. These designations will be specific for the primers used and once defined can be used over and over. An example of V gene identification is shown in Supplementary Table 1. Identifying other chains Additional examples of using the nomenclature for human α-TCR, β-TCR, γ-TCR, and δ-TCR are shown in Table 5. Since the δ-chain could be the result of either Vδ- or Vα-chain genes rearranging to Jδ, we show an example of the naming of both such possibilities (examples 4 and 5).
Decoding TCR clonotype identifier By decoding the name, the nucleotide sequence of the TCR chain can be derived in a reverse manner as that used for the encoding. Using the first example shown in Table 5 “rTs.4A38S2A53L13”, the “A38S2” and “AJ53” shows that the clonotype origin is human and the sequences of the alpha V38S2 and alpha J53 genes are needed for the decoding (Fig. 2
D regions Our nomenclature does not define the D region of the clonotype. The D regions could be defined after decoding by a homology search. Because of the truncation of D regions, they are often difficult to unambiguously assign. Defining the D region usage would be left to the individual investigator. Properties of the naming system The nomenclature described here has a number of important properties:
Advantages of the naming system By having these characteristics, the nomenclature has several general advantages. By combining all five elements of the CDR3 region, this system permits any clonotype to be defined. Our nomenclature is more compact than either a nucleotide- or amino-acid-based naming system. It is two-thirds shorter than the CDR3 nucleotide sequences, while still describing the nucleotide sequence. The CDR3 amino acid sequence is pared to the NDN contribution only. It distinguishes clonotypes that use different encoding for the same CDR3 amino acid sequence. For example, we have found 207 different clonotypes that use the same BV19S1 and the same BJ2S7, and have the exactly the same CDR3 amino acid sequence (CASSIRSSYEQYF). Even if the amino acid identifier of the name is the same without nucleotide identifier, it is impossible to distinguish between them. We show some examples of this in Table 6 from our analysis of the HLA-A2-restricted response to influenza M158–66. This shows the power of the nomenclature for defining population studies that deal with a large number of similar clonotypes.
By being compartmental, the proposed nomenclature can enumerate all possible names. Each compartment is an identifier. While it is unlikely that new J or V regions will be uncovered in mice or man, these could be easily absorbed into the name. The compartmentalization allows the level of identification of the V region to reflect in the name. If the identification of polymorphic variants of either V or J regions becomes important, the size of the compartment for these regions could be expanded to facilitate the addition. If the system were to be used for naming of BCR, an identifier for the heavy chain constant region could be added after the J identifier. The structure allows these identifiers to be fully computable and the character assignment of gene identifier makes it easy to sort based on the V gene, J gene, and the length. It also supports a built-in error checking for the digits in ID and number of amino acids in the NDN region, which is a one to one relation for all functional TCR clones. For example, if there are four amino acids in the NDN region, there would be four-digit numbers in the ID part of the name and the errors are easily found. Comparing TCR clonotypes To the extent that clonotypes are public (1), they can be identified in many laboratories. Thus, a fixed nomenclature will avoid difficulties associated with local identifiers. For example, the M1 response in HLA-A2 individuals has been studied by many groups. We show that some of the published clonotypes identified by Moss et al. in 1991 and by Lehner et al. in 1995, and us (Naumov et al. 1998, 2006) were observed in more than one study (Table 7). This shows the power of a common robust naming system in comparing the results of related studies that have been published independently.
Alternative codon numbering systems We also examined an alternative approach for codon numbering. We used the same codon numbering table, as described above, but then generated a list of all the possible ways for encoding of a particular NDN sequence. The observed sequence is then defined by its index position on the list. The IRSS amino acid sequence could be used as an example: when the clonotype identifier encodes I1 R1, S1, S1, ID number would be 1 instead of 1111. When the clonotype identifier encodes I1 R1, S1, S2, ID number would be 2 instead of 1112. Since IRSS has 648 possible encoding combinations (3 × 6 × 6 × 4), the identifiers would be shorter using one to three characters. However, the disadvantage of using this approach is that it is less direct and requires a computer program for optimal implementation. This takes away the ability for an individual to manually identify or decode a particular sequence.Summary We have implemented a rational nomenclature system that makes the TCR sequences easier to read and compare. The nomenclature rules are simple and easy to implement. Having the codon table available, it would be easy to name any TCR clonotype or clone without developing customized naming software. Nevertheless, the rules are simple enough to be encoded in computer programs. It has a built-in error-checking system which is the one to one correspondence between each digit in ID and each uppercase amino acid in the clonotype identifier. The benefits of our consistent nomenclature would accrue exponentially as the number of TCR under the study increases.Implementing this nomenclature would facilitate deployment of clonotype databases run by individual laboratories for specific immune responses and immune diseases. The clonotype names within these databases would be reliable, error-free, and allow easy cross-referencing and comparison of T cell repertoires by different laboratories. The clonotypes could be cataloged in a single database and annotated as to their occurrence and associations with particular responses. Such a catalog is only possible by providing an easy-to-use nomenclature. T cell clones can be unambiguously identified by naming both chains. The same identification could be provided for single-cell PCR data where both chain sequences are available. The framework of this naming system could also be implemented for B cell receptors if a system was added to account for somatic hypermutation. Such a convention would open up the possibility of creating a BCR catalog which would be a useful tool for investigators working on BCR repertoires. Below is the link to the electronic supplementary material.
Supplementary Table 1(2.0M, pdf) Examples of human TCRBV subfamilies that have more than one member: sequence homology and their assigned name (PDF 2137 kb) Acknowledgments We thank Dr. Marie-Paule Lefrance for clarification of the mouse TCR nomenclature. We also thank Dr. Andrea Ferrante for helpful discussions. This work was funded by the National Institutes of Health Grant U19 AI062627. Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Footnotes Electronic supplementary material The online version of this article (doi:10.1007/s00251-009-0383-x) contains supplementary material, which is available to authorized users. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Immunity. 2001 Jan; 14(1):21-32.
[Immunity. 2001]Proc Natl Acad Sci U S A. 2008 Feb 12; 105(6):2034-9.
[Proc Natl Acad Sci U S A. 2008]J Immunol. 1998 Mar 15; 160(6):2842-52.
[J Immunol. 1998]J Immunol. 2004 Mar 1; 172(5):3151-6.
[J Immunol. 2004]Immunol Rev. 2007 Feb; 215():215-25.
[Immunol Rev. 2007]Nature. 1988 Jul 14; 334(6178):156-9.
[Nature. 1988]Cell. 1988 Mar 25; 52(6):831-41.
[Cell. 1988]Annu Rev Immunol. 2006; 24():419-66.
[Annu Rev Immunol. 2006]Nature. 1988 Feb 18; 331(6157):627-31.
[Nature. 1988]Nature. 2005 May 12; 435(7039):224-8.
[Nature. 2005]J Clin Invest. 1996 Oct 15; 98(8):1802-8.
[J Clin Invest. 1996]J Immunol. 1998 Mar 15; 160(6):2842-52.
[J Immunol. 1998]Science. 1996 Jun 21; 272(5269):1755-62.
[Science. 1996]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D256-61.
[Nucleic Acids Res. 2005]Science. 1996 Jun 21; 272(5269):1755-62.
[Science. 1996]Science. 1996 Jun 21; 272(5269):1755-62.
[Science. 1996]Science. 1996 Jun 21; 272(5269):1755-62.
[Science. 1996]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D256-61.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 1991 Oct 15; 88(20):8987-90.
[Proc Natl Acad Sci U S A. 1991]J Exp Med. 1995 Jan 1; 181(1):79-91.
[J Exp Med. 1995]J Immunol. 1998 Mar 15; 160(6):2842-52.
[J Immunol. 1998]J Immunol. 2006 Aug 1; 177(3):2006-14.
[J Immunol. 2006]J Immunol. 1998 Mar 15; 160(6):2842-52.
[J Immunol. 1998]J Immunol. 2006 Aug 1; 177(3):2006-14.
[J Immunol. 2006]Proc Natl Acad Sci U S A. 1991 Oct 15; 88(20):8987-90.
[Proc Natl Acad Sci U S A. 1991]J Exp Med. 1995 Jan 1; 181(1):79-91.
[J Exp Med. 1995]