![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2002, Cold Spring Harbor Laboratory Press The Role of Lineage-Specific Gene Family Expansion in the Evolution of Eukaryotes National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA 1Corresponding author. Received February 8, 2002; Accepted May 8, 2002. This article has been cited by other articles in PMC.Abstract A computational procedure was developed for systematic detection of lineage-specific expansions (LSEs) of protein families in sequenced genomes and applied to obtain a census of LSEs in five eukaryotic species, the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe, the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and the green plant Arabidopsis thaliana. A significant fraction of the proteins encoded in each of these genomes, up to 80% in A. thaliana, belong to LSEs. Many paralogous gene families in each of the analyzed species are almost entirely comprised of LSEs, indicating that their diversification occurred after the divergence of the major lineages of the eukaryotic crown group. The LSEs show readily discernible patterns of protein functions. The functional categories most prone to LSE are structural proteins, enzymes involved in an organism's response to pathogens and environmental stress, and various components of signaling pathways responsible for specificity, including ubiquitin ligase E3 subunits and transcription factors. The functions of several previously uncharacterized, vastly expanded protein families were predicted through in-depth protein sequence analysis, for example, small-molecule kinases and methylases that are expanded independently in the fly and in the nematode. The functions of several other major LSEs remain mysterious; these protein families are attractive targets for experimental discovery of novel, lineage-specific functions in eukaryotes. LSEs seem to be one of the principal means of adaptation and one of the most important sources of organizational and regulatory diversity in crown-group eukaryotes. [Supplemental material is available online at ftp://ncbi.nlm.nih.gov/pub/aravind/expansions, and http://www.genome.org.] The eukaryotic crown group (the unresolved assemblage of lineages in the eukaryotic tree, which includes plants, animals, fungi, and some protists, as opposed to early branching eukaryotes, which are all unicellular protists), although only representing the proverbial tip of the eukaryotic phylogenetic iceberg, encompasses a remarkable variety of organisms (Patterson 1999; Dacks and Doolittle 2001). This diversity is apparent in both morphological and biochemical features of the crown group that spans the entire range from unicellular yeasts and chlorophytes, through facultatively multicellular slime molds, to genuine multicellular organisms, plants, animals, and fungi (Sogin et al. 1996; Patterson 1999). The complete, or nearly complete, genome sequences from three major branches of the crown group, plants, animals, and fungi are starting to provide the first molecular explanations for both their unity and diversity. From one viewpoint, the crown-group eukaryotes are remarkably uniform in that they share a large set of conserved orthologs in the core components of their essential functional systems, such as those involved in DNA replication and repair, most aspects of RNA metabolism, cytoskeletal organization, protein degradation, and secretion (Chervitz et al. 1998; Rubin et al. 2000; Lander et al. 2001). Furthermore, components of the signal transduction pathways, structural and regulatory components of the nucleus, and pre-mRNA processing complexes, although showing clear differences between the major crown-group lineages, are largely constructed from the same set of protein domains, and are based on the same architectural principles (Chervitz et al. 1998; Aravind and Subramanian 1999; Rubin et al. 2000; Lander et al. 2001). This unity notwithstanding, preliminary comparative studies on the sequenced eukaryotic genomes also provided clues as to what evolutionary phenomena might underlie their diversity. At the level of the protein sets encoded in the crown-group genomes, the main contributing forces appear to be the emergence of new domain architectures through domain accretion and domain shuffling, lineage-specific gene loss, and lineage-specific expansion of protein families (Aravind and Subramanian 1999; Aravind et al. 2000; Rubin et al. 2000; Lander et al. 2001). Lineage-specific expansion (LSE) is defined in relative terms, as the proliferation of a protein family in a particular lineage, relative to the sister lineage, with which it is compared (Jordan et al. 2001). Thus, if two sister lineages, for example, Drosophila and Caenorhabditis representing insects and nematodes, respectively, are compared, all protein-family proliferation events (duplications to n-plications) that occurred in either of these lineages after their separation are considered LSEs. Preliminary analysis of proteins from the crown-group eukaryotic genomes revealed some tangible correlations between LSE and emergence of new biological functions, response to diverse environmental pressures, and organizational complexity. Some of the most striking cases of LSE are related to pathogen and stress response and include, among other families, expansions of the immunoglobulin superfamily associated with the vertebrate immune system, AP-ATPases involved in plant disease resistance (Hulbert et al. 2001), and the cytochrome P450 family, which participates in detoxification systems in both plants and animals (Nelson 1999; Tijet et al. 2001). Transcription factors represent another functional category of proteins that tend to show widespread LSE: the independent expansions of the POZ–C2H2 and C4DM–C2H2 fusions in insects, the nuclear hormone receptors in nematodes, and the KRAB-domain-fused Zn-fingers in vertebrates, apparently made substantial contributions to the evolution of developmental and differentiation features specific to each of these lineages (Sluder et al. 1999; Aravind et al. 2000; Riechmann et al. 2000; Coulson et al. 2001; Lander et al. 2001). Despite a wealth of anecdotal information, we are unaware of a systematic comparative analysis of LSEs in eukaryotic genomes. With this objective, we devised a procedure to systematically detect LSEs. Having identified LSEs in five eukaryotic proteomes, those of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana, we predicted, wherever feasible, the biochemical or biological functions of the lineage-specific clusters (LSC) and explored their potential roles in the diversification of the crown group. Here, we present a systematic analysis of the demography of LSEs and provide evidence for a major involvement of LSEs in the generation of the diversity of biological functions in multicellular eukaryotes. RESULTS AND DISCUSSION Identification and Validation of Candidate Lineage-Specific Clusters Using the clustering procedure described in the Methods section, we delineated candidate LSCs for five eukaryotic genomes. The automatically generated LSCs were further surveyed for false positives, that is, proteins that were unrelated to the rest of the proteins in the cluster, by using BLAST searches and multiple alignments. A subset of false-positives arose from compositionally biased segments that escaped filtering during the automatic process. The presence of some false-positives was mainly due to one or more of the proteins in a cluster containing multiple domains or being artificially fused to another protein. The majority of such false-positives were detected among A. thaliana proteins, in which gene prediction errors resulted in artificial fusions of distinct genes. On several occasions, these artificial gene fusions resulted in an erroneous merger of one or more distinct clusters; these were manually separated. Additionally, a few smaller clusters that belonged to a larger LSE-specific expansion were merged. On average, ~9% of the LSCs of size greater than two were subjected to manual corrections. The automatic procedure used for delineating candidate LSCs included single-linkage clustering of proteins by sequence similarity and an ultrametric tree construction using UPGMA (see Methods). These methods accurately reproduce phylogenetic relationships only under the strict molecular clock hypothesis. Therefore, to verify the phylogenetic coherence of the candidate clusters, 10 of the candidate LSCs from each analyzed species that consisted of 4 or more members and had homologs in other species were chosen for phylogenetic analysis. In each tree, the proteins from the candidate LSC grouped together and, in 48 of the 50 cases, this grouping was strongly supported by bootstrap analysis (>70%) to the exclusion of homologs from other species and paralogs from the same species that do not belong to the given LSC (Fig. (Fig.1;1
The two ascomycete yeasts, S. pombe and S. cerevisiae, were the closest pair of sister lineages compared. The two animals, D. melanogaster and C. elegans, represented a slightly greater phylogenetic divergence relative to each other, whereas the plant A. thaliana represented an even deeper branch with respect to animals and fungi. Thus, the LSCs from each of these species enabled us to examine the role of LSEs in diversification of eukaryotes at different levels of evolutionary divergence. Proteome-Wide Demography of Lineage-Specific Family Expansion The detected LSEs encompassed between ~20% of the proteome (the yeasts) and ~80% (A. thaliana) (Fig. (Fig.2A).2 = ck−γ in which P(k) is the frequency of families with exactly k members and c and γ are constants (Fig. (Fig.3).3
To characterize the role of LSEs in the evolution of the respective classes of paralogous proteins in each lineage, we devised the expansion coefficient (EC), which is the ratio of the number of proteins in LSCs to the total membership of the given class of paralogs in a given proteome. The EC is a measure of the fraction of a given paralogous class that has evolved through LSE after the divergence of the given lineage from the closest sister lineage included in the analysis. LSCs with EC = 1 are those families that have been invented de novo and proliferated thereafter in a particular lineage. The relative abundance of LSCs in the EC range between 0 and 0.9 is roughly constant for all taxa considered here, with slightly >5% of the LSCs in each of the bins of size 0.1 in this range (Fig. (Fig.4).4
Analysis of the top 25 LSEs with EC = 1 from all proteomes pooled together, indicated that the majority of them are α-helical proteins or have conserved patterns of histidines and cysteines. Typical examples include the α-helical nonspecific lipid-transfer protein in plants, the C4Dm domain in D. melanogaster that chelates cations through conserved cysteines, and the T20D4.15 family of disulfide-bonded secreted proteins from C. elegans. Thus, de novo emergence of protein domains that substantially contributed to LSEs appears to have involved primarily invention of structurally simple folds. These folds could have evolved through compaction of long α-helical coiled coils or through disulfide-bond- or metal-supported stabilization mediated by a few strategically placed, conserved cysteines and/or histidines. Invention of such simple domains could have been more expedient than emergence of complex α/β structures that require several specific stabilizing interactions to be fixed (Aravind and Koonin 2000).Biological Significance of Lineage-Specific Expansions The above observations show that, quantitatively, LSEs are a major component of the differences between the proteomes of various eukaryotic taxa. New paralogous families could provide the material for specific adaptations and for evolution of new functional systems. In qualitative terms, we sought to investigate the biological significance of LSEs by identifying conserved domains, subcellular localization signatures, such as signal peptides and transmembrane regions, and other features of proteins in LSCs that might allow prediction of their functions (when less than obvious). These identifications for the top five LSCs in each organism are shown in Table 1. We categorized the LSCs into broad functional classes to discern global functional trends and also investigated individual LSCs in an attempt to gain a more detailed understanding of their actual biological roles (Table 2; Supplementary Material available online.).
Although LSEs occurred in most biological functional classes, LSCs with predicted organism-specific functions, such as pathogen and stress response, transcription regulation, controlled protein degradation mediated by the ubiquitin system, protein modification, signal transduction, chemoreception, and small molecule metabolism were most abundant (Tables 1 and 2). A typical example of an expansion related to an organism-specific function is that of the C. elegans collagens, which are required for cuticle formation, a characteristic adaptation of the nematodes (Johnstone 2000). Similarly, in D. melanogaster and Arabidopsis, prominent LSEs are, respectively, the insect cuticular proteins (Andersen et al. 1995) and pectin/cellulose biosynthesis enzymes (Willats et al. 2001), both of which are critical for the formation of morphological features unique to these lineages. Typically, these proteins are required in large amounts as structural components of the respective organisms; hence, these lineage-specific expansions could principally help in increased production of these proteins. Extending this analogy, it is possible that several of the LSCs with no detectable homologs elsewhere could represent as yet uncharacterized, but abundant, lineage-specific structural proteins (Table 2). Many of the identified LSCs had predicted biochemical characteristics that pointed to roles in stress and pathogen response. Particularly striking in this category was the expansion of proteases of the pepsin-like and subtilisin-like families in A. thaliana, trypsin-like proteases in D. melanogaster, and Zn-metalloproteases in C. elegans (Table 2). All of these proteases are predicted to be secreted molecules, and their repeated, independent expansion suggests that they are widely utilized either for direct degradation of pathogen proteins or as components of stress-triggered proteolytic cascades broadly analogous to the vertebrate complement and clotting systems (Bouchard and Tracy 2001; Southan 2001). Alternatively, in the case of plants, they could aid in protein digestion in the process of germination. Better-understood cases of similar lineage-specific expansions related to stress/pathogen-response components include the massive proliferation of apoptotic (AP-) ATPases and the accompanying moderate expansion of metacaspases in plants, and the parallel expansion of caspases in vertebrates (Aravind et al. 2001; Holub 2001). These proteins are either known or predicted to participate in multiple pathways associated with apoptosis or hypersensitive response. In this context, also of interest are the expansions of molecules containing modules functioning in extracellular adhesion. Prominent examples of these include the C-type lectins (D. melanogaster, C. elegans), PR1 proteins (C. elegans, A. thaliana), CUB domain proteins (C. elegans), and the bulb-lectin domain (A. thaliana). As with the immunoglobulin domain protein, that are highly expanded in vertebrates, these molecules probably participate in the recognition and binding of specific pathogens as a part of defense mechanisms of the corresponding organisms (Table 2). Earlier analysis of the LSEs involving transcription factors had suggested that they included proteins regulating critical aspects of the development of the organism (Aravind and Koonin 1999; Riechmann et al. 2000; Lander et al. 2001). For example, the proteins belonging to the POZ and SAZ-type Myb domain expansions in D. melanogaster (Table 2) regulate as diverse functions as maintenance of the antero-posterior Hox gene expression pattern, neurite outgrowth and pathfinding, and organogenesis (Aravind and Koonin, 1999; Lander et al. 2001). Thus, it appears that proliferation of new transcription factor families, followed by their recruitment as upstream or downstream regulators with respect to core conserved developmental pathways, have contributed substantially to the evolution of morphological diversity in animals. The generality of this observation was reinforced by the evidence of massive, lineage-specific expansion and diversification of various transcription-factor families in the plant A. thaliana (Table 2). Many of these include well-characterized DNA-binding proteins, such as the MADS box and MYB domain proteins, that have been shown previously to participate in plant-specific functions, including development of flowers and other structures, meristemal differentiation, and organ-specific gene expression (Riechmann et al. 2000). In this study, we detected certain unexpected expansions of DNA-binding proteins in plants that might point to previously unrecognized transcription regulators. Examples include the proteins homologous to the mitochondrial transcription termination factor, which, in other eukaryotes, is present in a single copy that functions in the mitochondrion (Fernandez-Silva et al. 1997). The additional paralogs in plants have probably acquired different transcription-related functions because they form a tight cluster, distinct from the ancestral mitochondrial version. Plants also show an expansion of the DNA-binding replication factor A (RF-A), with >40 copies in A. thaliana, in contrast to the one-three copies observed in other eukaryotes. The expansion and divergence of RF-A in plants suggest that the plant-specific paralogs are probably utilized as transcription factors rather than in their usual capacity in replication (Wold 1997). These and other such examples (Table 2) illustrate that transcription factors are recruited from a wide variety of pre-existing sources and diversify to occupy new functional niches via LSE. We observed a major role of LSE in the elaboration of the ubiquitin pathway, which is involved in the degradation and regulatory modifications of proteins (Hershko and Ciechanover 1998). Evidence of LSE was obtained for several components of the ubiquitin system, in particular, E3 subunits of ubiquitin ligases containing the F-box domain (Kipreos and Pagano 2000) (A. thaliana and C. elegans) and the RING-finger (A. thaliana). Because the E3 proteins are specificity determinants that are involved in targeting the conserved ubiquitin-ligation machinery system to specific substrates (Jackson et al. 2000), their diversification through LSE probably provides a means of harnessing an otherwise conserved system to regulate the degradation of diverse sets of targets. In a similar vein, both nematodes and plants also show independent LSEs of the MATH domain. This domain, which tends to form fusions to ubiquitin carboxy-terminal hydrolases or RING-finger E3s (Aravind et al. 1999; Polekhina et al. 2002), might serve as an additional adaptor that mediates de/ubiquitination of specific targets. A. thaliana has a prominent proliferation of the adenovirus-like thiol protease superfamily whose members (e.g., Smt4/Ulp1) in yeast and in vertebrates, remove ubiquitin-like proteins from their targets (Li and Hochstrasser 2000; Nishida et al. 2000). Thus, in plants, this LSC probably contributes to further diversification of the regulation of ubiquitin-dependent protein degradation. Targeting of proteins for degradation has been shown to occur through the recognition of hydroxyproline by ubiquitin ligase complexes (Ivan et al. 2001). Thus, the LSE of 2-oxoglutarate-dependent prolyl hydroxylases (Aravind and Koonin 2001) detected in D. melanogaster and A. thaliana could represent another case in which the range of the core ubiquitination pathway is expanded via diversification of the terminal effectors. The role of LSE in the diversification of proximal components of signal transduction systems, receptors, had been noticed previously in the cases of independent expansions of odorant receptors/7-transmembrane chemoreceptors seen in different animal lineages (Dryer 2000; Glusman et al. 2001) and plant receptor kinases containing extracellular leucine-rich repeats, bulb lectin, or EGF-like extracellular domains (Shiu and Bleecker 2001). Here, we detected other analogous expansions of upstream signaling proteins, such as potassium channels, innexin family channels (both in C. elegans), and tetraspanins and degenerin-type channels in D. melanogaster (similar LSEs of K-channels and tetraspanins are also seen in humans). The proteins involved in these expansions are linked to the organism's responses to external as well as internal homeostatic stimuli. Thus, such expansions could serve as the raw material for the behavioral and physiological adaptation of organisms to their specific environments. Lineage-specific expansions are also seen in a range of protein-modifying enzymes of different signal transduction cascade, such as protein kinase families in most lineages, SET-domain protein-methylases in D. melanogaster, and PP2C phosphatases in plants. As with the ubiquitin system, these appear to be a means of linking well-conserved stems of signaling pathways to distinct sets of terminal targets. Another aspect of the involvement of LSEs in the evolution of signal-transduction networks is the extensive proliferation of families of proteins containing adaptor domains. Along with their expansion, many adaptor domains have also recombined with a variety of other domains, probably allowing the emergence of new networks of interactions. A striking example is the major expansion of proteins containing the small Ca-binding octicosapeptide (OOP) module (Ponting 1996) in A. thaliana. Some OOP modules are fused to VIV1-like plant-specific DNA-binding proteins and a specialized class of GAF domains, suggesting that they link transcription regulation and small molecule interactions to Ca-dependent signaling. Another notable case is a novel adaptor domain, typified by the amino-terminal domain of the Caspase-1A isoform, which so far was detected only in C. elegans. Altogether, the C. elegans genome encodes >40 members of this domain family, which, in addition to the caspase fusion, also form multidomain proteins with SET-domain methylases, PHD fingers, and kinases. Given the α-helical structure predicted for this domain, and enrichment in charged residues, it probably functions as a protein–protein interaction module. Another, somewhat unexpected generalization that emerged from the present analysis is the prevalence of small molecule-modifying enzymes among the LSEs. In plants, the proliferation of such enzymes, namely methylases of the caffeic acid O-methylase family, dioxygenases of the gibberellin-hydroxylase family, and a variety of lipases and acyltransferases, correlates with the plethora of secondary metabolites, such as pigments, volatile aromatic compounds alkaloids, and waxes that are produced by plants (Seigler 1998). However, their large numbers suggest that the entire diversity of metabolites produced even by plants such as A. thaliana with relatively simple genomes is under-appreciated to a large extent. Interestingly, animals also have several LSEs associated with small molecule metabolism. Some of these, such as glycosyltransferases and acyltransferases, suggest there might be an as yet unexplored, lineage-specific diversity of carbohydrates and lipid moieties that are associated with glycoproteins, lipoproteins, and other cellular metabolites. The two independent expansions of predicted small-molecule kinases related to ethanolamine and aminoglycoside kinases (Hon et al. 1997) (in D. melanogaster and, to a lesser extent, in C. elegans) and the expansion of secreted methylases in C. elegans are particularly enigmatic. Given the role of the related bacterial kinases and methylases in xenobiotic resistance (Haggblom 1990), these enzymes might be used to modify a range of xenobiotics encountered by the animals in their specific environments. Alternatively, they could modify various environmental substances to convert them to forms more easily sensed by the chemoreceptors of these organisms. Conclusions A computational procedure for systematic detection of lineage-specific expansions of protein families was developed and applied to obtain a comprehensive census of LSEs in five eukaryotic genomes. LSEs appear to have played an important role in the growth and differentiation of the proteomes of multicellular eukaryotes. Many paralogous gene families in crown-group eukaryotes appear to have evolved almost entirely through LSE after the divergence of the examined sister lineages from their ancestors. This fundamental process of gene family expansion was active at a wide range of phylogenetic distances, from the relatively close species of yeasts to the much earlier separation of plants from the rest of the crown-group taxa. Generally, the fraction of proteins found in LSCs and the fraction of large families among LSCs positively correlate with the size of eukaryotic proteomes. Examination of the known and predicted functions of the detected LSEs reveals certain general principles. Genes encoding proteins typically required in large quantities as components of an organism's morphological structures are often subject to LSE and appear to be fixed versions of the common phenomenon of gene amplification, with fine-tuning added through sequence diversification (Kondrashov et al. 2002). Another major set of LSCs consists of proteins involved in recognition and binding of pathogens and xenobiotics and withstanding environmental stress. These LSCs probably provide the raw material for generating the diversity required to counter rapidly changing pathogens and to respond to other variable environmental factors. Expansion followed by diversification of the proteins in the LSCs appears to be a common means of generating new specificities in signaling pathways. In particular, in the ubiquitin system, a large number of the E3 components of the ubiquitin ligase, which target it to specific proteins, are drawn from LSEs. Expansions of adaptor modules followed by their fusion to diverse domains probably result in the emergence of novel interactions that contribute to signaling and transcription regulation. Several expanded enzyme families also point to the existence of an, as yet, undiscovered diversity of small molecule metabolites in various lineages. Thus, LSE seems to be one of the most important sources of structural and regulatory diversity in crown-group eukaryotes, which was critical for the tremendous exploration of the morphospace seen in these organisms. METHODS The protein set for the nematode C. elegans was from the WormPep20 data set (http://www.sanger.ac.uk/Projects/C_elegans/wormpep); the protein sets for other analyzed eukaryotic species were extracted from the NCBI (NIH) nonredundant (nr) protein sequence database. The human protein set was not systematically analyzed because of extensive problems with gene predictions, resulting in fragmentary proteins, artificial fusions, and inclusion of pseudogene translations and translation of noncoding DNA. Identical or nearly identical (98% or greater) sequences were removed from the data sets using the BLASTCLUST program. For documentation on its use, see ftp://ftp.ncbi.nlm.nih.gov/blast/documents/README.bcl. LSCs were identified using the following procedure: BLAST comparisons for all proteins in the analyzed set of complete eukaryotic genomes were run against the database consisting of the same set of proteins. Symmetrical relative similarity scores (RAB = RBA = max(SAB/SAA,SBA/SBB), in which SAB is the BLAST bit score for query A and subject B were recorded. Such scores range from 0 (no significant hit found) to 1 (identical proteins). For each protein A in a given genome X (e.g., C. elegans), a set of candidate family comembers {B} was defined as a set of proteins from the same genome X satisfying the condition (RAB>RAC; for C X) (i.e., similarity between the given protein A and another C. elegans protein B is greater than that between A and any protein C from any other genome). Then, all such sets from X were merged if they shared at least one member (single-linkage clustering), resulting in grouping all proteins from X into clusters {A} (many of which might contain only a single protein). This procedure leads to heavy overclustering because, even if only one pair of proteins in two distinct LSCs passes the comembership condition (e.g., due to fluctuations in the observable similarity), the two LSCs are merged by the single-linkage algorithm. This over-inclusive set of clusters was refined through identification of the most closely related proteins from other genomes. For each A {A}, the best alien hit C was identified as [C ‖ max(RAC); C X]. Sets {A} {C} (i.e., candidate LSC members and their closest alien relatives) were subject to UPGMA clustering on the basis of relative similarity scores. Under this procedure, proteins from other genomes that show high similarity to some candidate LSC members may intrude into the cluster and split it apart. Subclusters {A′} satisfying [A′ X] (i.e., UPGMA subtrees consisting of proteins exclusively from the currently analyzed genome X) and including more than one protein were considered to represent LSCs.Protein sequence similarity searches were performed using the gapped BLASTP program against the nonredundant protein sequence database (NCBI, NIH). Iterative profile searches to detect more distant relationships were performed using the PSI-BLAST program (Altschul et al. 1997), with the inclusion threshold typically set at E = 0.01; only predicted globular regions from proteins were used as seeds for PSI-BLAST searches. Proteins were partitioned into probable globular and nonglobular regions using the SEG program (Wootton 1994). Conserved domains were detected using domain-specific PSSMs constructed using the PSI-BLAST program (Chervitz et al. 1998). Multiple alignments were constructed using the T_Coffee (Notredame et al. 2000) and ClustalX (Thompson et al. 1997) programs and corrected manually on the basis of PSI-BLAST search results, which, on some occasions, correctly detect conserved sequence motifs missed by multiple alignment methods. These alignments were used to construct Neighbor Joining phylogenetic trees (Saitou and Nei 1987) using the PAUP* (Swofford 1998) and PHYLIP (Felsenstein 1996) package (the evolutionary distances were calculated using the PROTDIST program of PHYLIP), and the support for nodes of interest was evaluated by use of 1000 bootstrap replicates. Secondary structure of proteins was predicted using the PHD program, with multiple alignments used as input for prediction (Rost and Sander 1994). Signal peptides were predicted using the SignalP program (Nielsen et al. 1997).The supplementary material available online at ftp://ncbi.nlm.nih.gov/pub/aravind/expansions, and http://www.genome.org includes: (1) Complete lists of proteins in the identified lineage-specific clusters from five eukaryotic species (Format: text files). 2). The phylogenetic trees that were constructed to verify the ability of the above reported procedure to correctly detect lineage specific expansions (Format: text file containing trees that can be visualized with the Treeview program; Roderic Page; URL: http://taxonomy.zoology.gla.ac.uk/rod/treeview.html (3). A detailed version of table 2 with references for the entries wherever possible (Format: PDF). WEB SITE REFERENCES ftp://ftp.ncbi.nlm.nih.gov/blast/documents/README.bcl; Documentation for the BLASTCLUST program. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome; Source of the analyzed protein sequence set except for those of C. elegans. http://www.sanger.ac.uk/Projects/C_elegans/wormpep; Wormpep database, the source of the C. elegans proteins. ftp://ncbi.nlm.nih.gov/pub/aravind/expansions; Supplementary material. http://taxonomy.zoology.gla.ac.uk/rod/treeview.html; TreeView program for phylogenetic tree visualization. Acknowledgments We thank I. King Jordan and Kira Makarova for their help in developing the procedures for identifying the LSCs. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact. Footnotes E-MAIL koonin/at/ncbi.nlm.nih.gov; FAX (301) 435-7794. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.174302. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Am Nat. 1999 Oct; 154(S4):S96-S124.
[Am Nat. 1999]Cell. 2001 Nov 16; 107(4):419-25.
[Cell. 2001]Microbiologia. 1996 Mar; 12(1):17-28.
[Microbiologia. 1996]Science. 1998 Dec 11; 282(5396):2022-8.
[Science. 1998]Science. 2000 Mar 24; 287(5461):2204-15.
[Science. 2000]Curr Opin Genet Dev. 1999 Dec; 9(6):688-94.
[Curr Opin Genet Dev. 1999]Proc Natl Acad Sci U S A. 2000 Oct 10; 97(21):11319-24.
[Proc Natl Acad Sci U S A. 2000]Science. 2000 Mar 24; 287(5461):2204-15.
[Science. 2000]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Genome Res. 2001 Apr; 11(4):555-65.
[Genome Res. 2001]Annu Rev Phytopathol. 2001; 39():285-312.
[Annu Rev Phytopathol. 2001]Arch Biochem Biophys. 1999 Sep 1; 369(1):1-10.
[Arch Biochem Biophys. 1999]Gene. 2001 Jan 10; 262(1-2):189-98.
[Gene. 2001]Genome Res. 1999 Feb; 9(2):103-20.
[Genome Res. 1999]Proc Natl Acad Sci U S A. 2000 Oct 10; 97(21):11319-24.
[Proc Natl Acad Sci U S A. 2000]Mol Biol Evol. 1998 May; 15(5):583-9.
[Mol Biol Evol. 1998]J Mol Biol. 2001 Nov 2; 313(4):673-81.
[J Mol Biol. 2001]Proc Natl Acad Sci U S A. 2000 Oct 10; 97(21):11319-24.
[Proc Natl Acad Sci U S A. 2000]Genome Res. 2000 Apr; 10(4):416-30.
[Genome Res. 2000]Genome Res. 2000 Aug; 10(8):1172-84.
[Genome Res. 2000]Trends Genet. 2000 Jan; 16(1):21-7.
[Trends Genet. 2000]Insect Biochem Mol Biol. 1995 Feb; 25(2):153-76.
[Insect Biochem Mol Biol. 1995]Plant Mol Biol. 2001 Sep; 47(1-2):9-27.
[Plant Mol Biol. 2001]Curr Opin Hematol. 2001 Sep; 8(5):263-9.
[Curr Opin Hematol. 2001]FEBS Lett. 2001 Jun 8; 498(2-3):214-8.
[FEBS Lett. 2001]Science. 2001 Feb 16; 291(5507):1279-84.
[Science. 2001]Nat Rev Genet. 2001 Jul; 2(7):516-27.
[Nat Rev Genet. 2001]J Mol Biol. 1999 Jan 29; 285(4):1353-61.
[J Mol Biol. 1999]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]EMBO J. 1997 Mar 3; 16(5):1066-79.
[EMBO J. 1997]Annu Rev Biochem. 1997; 66():61-92.
[Annu Rev Biochem. 1997]Annu Rev Biochem. 1998; 67():425-79.
[Annu Rev Biochem. 1998]Genome Biol. 2000; 1(5):REVIEWS3002.
[Genome Biol. 2000]Trends Cell Biol. 2000 Oct; 10(10):429-39.
[Trends Cell Biol. 2000]Trends Biochem Sci. 1999 Feb; 24(2):47-53.
[Trends Biochem Sci. 1999]Nat Struct Biol. 2002 Jan; 9(1):68-75.
[Nat Struct Biol. 2002]Bioessays. 2000 Sep; 22(9):803-10.
[Bioessays. 2000]Genome Res. 2001 May; 11(5):685-702.
[Genome Res. 2001]Sci STKE. 2001 Dec 18; 2001(113):re22.
[Sci STKE. 2001]Protein Sci. 1996 Nov; 5(11):2353-7.
[Protein Sci. 1996]Cell. 1997 Jun 13; 89(6):887-95.
[Cell. 1997]J Basic Microbiol. 1990; 30(2):115-41.
[J Basic Microbiol. 1990]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Comput Chem. 1994 Sep; 18(3):269-85.
[Comput Chem. 1994]Science. 1998 Dec 11; 282(5396):2022-8.
[Science. 1998]J Mol Biol. 2000 Sep 8; 302(1):205-17.
[J Mol Biol. 2000]Nucleic Acids Res. 1997 Dec 15; 25(24):4876-82.
[Nucleic Acids Res. 1997]