![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright : © 2009 Hu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Global Functional Atlas of Escherichia coli Encompassing Previously Uncharacterized Proteins 1 Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada 2 Medical Research Council Laboratory of Molecular Biology, Cambridge, United Kingdom3 Department of Biology, Wilfrid Laurier University, Waterloo, Ontario, Canada 4 Department of Biology and Ottawa Institute of Systems Biology, Carleton University, Ottawa, Canada5 Department of Computer Science, Royal Holloway, University of London, Egham, United Kingdom Andre Levchenko, Academic Editor Johns Hopkins University, United States of America #Contributed equally. * To whom correspondence should be addressed. E-mail: gmoreno/at/wlu.ca (GM-H); Email: andrew.emili/at/utoronto.ca (AE) Received October 21, 2008; Accepted March 16, 2009. This article has been cited by other articles in PMC.Abstract One-third of the 4,225 protein-coding genes of Escherichia coli K-12 remain functionally unannotated (orphans). Many map to distant clades such as Archaea, suggesting involvement in basic prokaryotic traits, whereas others appear restricted to E. coli, including pathogenic strains. To elucidate the orphans' biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome consisting of 5,993 putative physical interactions and 74,776 putative functional associations, most of which are novel. Clustering of the respective probabilistic networks revealed putative orphan membership in discrete multiprotein complexes and functional modules together with annotated gene products, whereas a machine-learning strategy based on network integration implicated the orphans in specific biological processes. We provide additional experimental evidence supporting orphan participation in protein synthesis, amino acid metabolism, biofilm formation, motility, and assembly of the bacterial cell envelope. This resource provides a “systems-wide” functional blueprint of a model microbe, with insights into the biological and evolutionary significance of previously uncharacterized proteins. Author Summary One goal of modern biology is to chart groups of proteins that act together to perform biological processes via direct and indirect interactions. Such groupings are sometimes called functional modules. The types of protein interactions within modules include physical interactions that generate protein complexes and biochemical associations that make up metabolic pathways. We have combined proteomic and bioinformatic tools, and used them to decipher a large number of protein interactions, complexes, and functional modules with high confidence. In addition, exploring the topology of the resulting interaction networks, we successfully predicted specific biological roles for a number of proteins with previously unknown functions, and identified some potential drug targets. Although our work is focused on E. coli, our phylogenetic projections suggest that a considerable fraction of our observations and predictions can be extrapolated to many other bacterial taxa. As all the data derived from this study are publicly available, others may build on our work for further hypothesis-driven studies of gene function discovery. Introduction Because of its central position in the microbial research community, the Gram-negative bacterium Escherichia coli plays a leading role in investigations of the fundamental molecular biology of bacteria [1–8]. This experimentally tractable microbe is a workhorse in basic and applied research aimed at elucidating the mechanistic basis of prokaryotic processes and traits, including those of pathogens. The ever-expanding availability of genomic resources makes E. coli particularly well-suited to systematic investigations of microbial protein components and functional relationships on a global scale. These include a genome-wide collection of single-gene deletion strains [2] along with extensive knowledge of regulatory circuits [3,5,7,9] and metabolic pathways [6,10,11]. Yet despite being the most highly studied model bacterium, a recent comprehensive community annotation effort for the fully sequenced reference K-12 laboratory strains [8] indicated that only half (~54%) of the protein-coding gene products of E. coli currently have experimental evidence indicative of a biological role. The remaining genes have either only generic, homology-derived functional attributes (e.g., “predicted DNA-binding”) or no discernable physiological significance. Some of these functional “orphans” (not to be confused with ORFans, which are genes present within only single or closely related species) may have eluded characterization in part because they exhibit mild mutant phenotypes, are expressed at low or undetectable levels, or have limited homology to annotated genes. This suggests more-sensitive analytical procedures are warranted. A key feature of the molecular organization of all organisms, including bacteria, is the tendency of gene products to associate into macromolecular complexes, biochemical pathways, and functional modules that in turn mediate all the major cellular processes. Elaboration of these interaction networks via proteomic, genomic, and bioinformatic approaches can reveal previously overlooked components and unanticipated functional associations [12]. For example, a recent integrative analysis of phenotypic, phylogenetic, and physical interaction data led to the discovery of an evolutionarily conserved set of novel bacterial motility-related proteins [13]. However, although systematic integration of diverse high-throughput interaction datasets is routinely performed to reveal new functional relationships in model eukaryotes such as yeast, worm, and fly [14–19], few analogous studies of the global functional architecture of E. coli, and any prokaryote for that matter, have been reported to date [20–22]. To this end, we have combined complementary, highly sensitive computational and experimental procedures to derive extensive high-quality maps of the functional interactions inferred by genomic context (GC) methods and physical interactions (PI) deduced by proteomics of E. coli. Our results indicate that many previously unannotated bacterial proteins are components of functionally cohesive modules and multiprotein complexes linked to well-known biological processes. A substantive fraction of these associations could be verified by independent experimentation and were found to be broadly conserved across prokaryotic phyla, indicating homologous systems in other microbes, whereas others are seemingly restricted to the E. coli lineage. The entire data collection is publicly accessible via a searchable Web-browser interface to stimulate exploration of both conserved and specialized bacterial proteins within the context of biological processes of particular interest. Results The Extent of Existing Functional Annotation for E. coli Proteins Since the functional characterization of E. coli, and bacteria in general, has largely been guided historically by scientific interests and technical considerations, some bias is expected in terms of the coverage and depth of existing biological knowledge as reflected in current gene annotations. To evaluate the degree to which the physiological functions of the 4,225 putative protein-coding sequences of E. coli K-12 are characterized presently, we examined the scope of literature reference records curated in the UniProt annotation system [23]. After excluding PubMed references corresponding to genomic mapping studies, the average total number of papers associated with each of the proteins of E. coli K-12 is surprisingly limited (Figure 1
We next examined recent E. coli K-12 (substrains W3110 and MG1655) gene annotations in the public databases RefSeq [24], MultiFun [25], and EcoCyc [11]. Since W3110 is commonly used for high-throughput studies, we devoted the bulk of our subsequent analysis to this substrain. However, to make sure that relevant functional attributes were not overlooked, we cross-mapped the corresponding gene accessions in both substrains and compiled an inclusive set of functional annotations accordingly (Table S1). In total, we found that 2,794 (66%) of E. coli's proteins had either proper mnemonic names [26], experimentally derived annotations in the MultiFun multifunction schema, or literature documentation to a well-defined pathway or multiprotein complex in EcoCyc (Figure 1 Properties of the Functional Orphans of E. coli The genes lacking annotation appear to be translated into bona fide proteins as their corresponding transcripts [28] were not significantly (p = 0.36) less stable than the products of annotated genes (Figure 1 Orphans also generally find fewer orthologs in a nonredundant genome dataset, filtered at 90% similarity based on the frequency of shared orthologs (Figure 1 A Systematic Approach to Elucidate Biological Function The scarcity of the existing knowledge regarding the biological roles of the orphans is likely due to multiple reasons, ranging from the lower expression, nonessentiality, or smaller sizes of certain orphan proteins to their lack of obvious homologs in other organisms including humans. Accordingly, integration of multiple data sources is warranted to decipher the specific biological roles of this uncharacterized repertory. Since the elucidation of physical and functional interaction networks can provide insights into bacterial protein function based on the concept of guilt by association [30], we took a multipronged approach. We performed large-scale proteomic analysis to determine orphan participation as components of multimeric protein complexes, and inferred functional relationships based on genomic context inference, which exploits the patterns of gene conservation across bacterial genomes [31]. We then predicted the functions of the orphans using an integrative machine-learning procedure with extensive benchmarking. Finally, we performed independent experiments to validate a subset of high-confidence predictions related to core biological processes. Key steps in our pipeline are outlined schematically in Figure 2
Experimental Definition of the Physical Interaction Network of the Soluble Proteome We performed systematic large-scale tandem-affinity purifications of all endogenous soluble orphan and annotated proteins detectably expressed in E. coli W3110 under standard culture conditions (see Materials and Methods and Protocol S3 for details). We used an optimized Sequential Peptide Affinity (SPA)-tagging system to isolate multiprotein complexes [32]. This procedure is based on the integration of a marker cassette bearing a dual-affinity tag, consisting of three FLAG sequences and a calmodulin-binding peptide separated by a protease cleavage site, fused to the C-termini of targeted open reading frames in E. coli DY330 (W3110 background) via λ-phage “Red”–mediated homologous recombination. This system enables recovery of native bacterial protein complexes at near-endogenous levels [4], minimizing spurious nonspecific protein associations. Stably interacting polypeptides were subsequently detected using a highly sensitive combination of tandem mass spectrometry (LCMS) and peptide mass fingerprinting procedures (MALDI) to increase detection coverage and accuracy (Protocol S3), just as we had previously done in a focused investigation of highly conserved essential E. coli proteins [4]. We successfully chromosomally tagged 1,241 new baits, aiming to verify putative interactions by reciprocal tagging where possible, for a total of 1,476 large-scale protein purifications (after including the 235 reported previously), of which 552 represented orphans (Protocol S3). Since proteomic datasets typically contain noise in the form of nonspecific associations, we performed a careful statistical analysis and quality filtering to determine biologically meaningful PI. We considered that the specificity and affinity between any two putatively interacting proteins should be correlated with the consistency of copurification over all the experiments in which the proteins were identified (i.e., co-complexed). We therefore used an established copurification metric [33] to assess interaction specificity based on the similarity of the protein copurification patterns (Protocol S3). We then generated a single consolidated confidence score for each putative pairwise physical interaction based on the copurification metric together with the primary interaction evidence to penalize inconsistent or promiscuous binders (i.e., possible false positives) using alternatively a logistic regression model and Bayesian inference [34] (Protocol S3). The logistic regression model was trained using a reference set of curated gold-standard PI (Protocol S3), which represents the union of experimentally verified PIs derived from low-throughput experiments extracted from the Database of Interacting Proteins (DIP) [35], the Biomolecular Interaction Network Database (BIND) [36], and the IntAct database [37]. For the negative gold standards, we compiled pairs of proteins annotated with different subcellular localizations (i.e., one cytoplasmic, the other periplasmic or outer membrane-bound [38]. Despite its relative simplicity, the logistic regression model offered better performance than the Bayesian method (see Figure 3
The resulting final network consisted of 5,993 high-confidence, nonredundant pairwise interactions among 1,757 distinct E. coli proteins, including 451 orphans, or roughly two-thirds of the predicted soluble cytoplasmic proteome. As summarized in Figure 3 The reliability of our dataset was also evident by two additional independent criteria. First, the mRNA expression patterns of the putatively interacting proteins were nearly as highly correlated as those of PI determined by low-throughput experiments (Figure 3 Collectively, these results indicate that our physical interaction network is very likely to be informative about orphan protein function. Orphan Membership within Multiprotein Complexes Since macromolecular assemblies mediate biological function in cells, we partitioned our high-confidence physical interaction network using the Markov clustering algorithm (MCL; see Materials and Methods and Protocol S4) to define orphan membership as subunits of discrete multiprotein complexes. MCL simulates random walks (i.e., flux) to delimit highly connected subnetworks based on both the connectivity and the weight of the graph edges [40]. In this case, the weights reflect the integrated PI scores obtained by logistic regression (Figure 2 We optimized the MCL parameters (see Materials and Methods and Protocol S4) to partition the 5,993 PI network, generating a set of 443 putative multiprotein complexes (Figure 3 For example, 25 orphans were detected as part of a large subnetwork of putative complexes involved in protein synthesis (Figure 3 Other orphans in this translation subnetwork include YibL, which copurified both with YfgB and YbcJ, and with RNA processing factors involved in ribosome biogenesis, such as the RNA pseudouridine synthetases RluB/RluC and the RNA helicase DeaD, and with RppH (formerly NudH), which was recently identified as a regulator of 5′-end–dependent mRNA degradation [44–46]. Similarly, the orphan YdhQ copurified with translation elongation factor Tu, whereas YagJ interacted with lysine tRNA synthetase (LysU); and YjcF, which has similarity to phenylalanyl-tRNA synthetase PheT of Bacteroides vulgatus, bound ribosomal release factor 2 and another orphan, YbeB, which in turn was found to associate with the 50S ribosome subunit, as recently reported [47]. These results confirm that our high-confidence physical interaction network is informative about the function of at least certain orphans. Functional Interactions Predicted by Genomic Context Methods Although we attempted to tag and purify the entire soluble E. coli interactome, we failed to detect 469 orphan proteins by MALDI or LCMS, presumably because they are membrane-associated (~35%; Figure S1B) and hence not soluble, or are of particularly low abundance (~40%), as reflected by their CAI and mRNA levels (Figures S1C and S1D, respectively). To bypass this limitation, we applied computational methods to discern a network of high-confidence pairwise functional interactions for all E. coli proteins, including those not detectable by proteomic methods, by examining the natural chromosomal clustering of bacterial genes. As illustrated in Figure 2 The pairwise interactions generated by each of these prediction methods were independently evaluated by benchmarking using suitable gold standards. Positive gold standards were defined as pairs of E. coli genes belonging to the same biological pathway as defined in EcoCyc, while the negative gold standards represented pairs of annotated E. coli genes whose products participate in different pathways (see Protocol S5 for details). The results of each GC method were subsequently combined to create a single unified functional association score (Protocol S6 and Figure 2
Despite the tendency of the orphans to exhibit more limited conservation notwithstanding the dependency of GC methods on homologs in multiple species (except for operon predictions based on intergenic distances [55]), our combined GC network implicated virtually all (1,367, or 96%) of the orphans in 23,365 pairwise functional interactions (Table S8). Moreover, relatively few (<18%) of our predicted interactions appear to have been reported previously (Figure 4 The reliability of our unified functional association network was independently corroborated based on the high correlations of expression among putatively interacting gene pairs (Figure 4 Defining the Participation of Orphans as the Components of Functional Modules Groups of functionally interacting genes form functional modules centered on a common process or biochemical pathway(s). To define orphan participation as components of such modules, we partitioned the high-confidence GC network using MCL (Protocol S4), generating a total of 507 putative functional modules consisting of two or more components (Figure 4 Two hundred and eighty-nine (57%) of the modules had at least one of a total of 1,189 different orphans. One notable example is shown in Figure 4 Several other prominent modules are shown in Figure 4 Other functional modules include frlA/frlB, part of the Frl operon of E. coli responsible for the import and metabolism of the alternative carbon source fructoselysine, together with the orphan YifK, which has sequence characteristics of a transporter [38], implicating it in electrochemical potential-driven uptake of this sugar. Conversely, two orphans, YecC and YecS, had functional associations consistent with linkages to amino acid and nucleotide metabolism, four (YagU, YqeG, YhaO, and YhaM) were linked to a putative module involved in transport and metabolism of threonine and serine, whereas three others (YjjI, YeiM, and YjjJ) were found in a module enriched for factors involved in nucleotide transport and degradation of deoxyribonucleosides. Taken as a whole, these results suggest discrete functional relationships for many previously unannotated proteins, implicating certain orphans within specific pathways. Improved Functional Inference within an Integrated Network Framework Examination of the extent of overlap between our physical and functional networks, both in terms of common binary interactions and shared components among the derived complexes (from PI) and modules (from GC), indicated that they are largely complementary (Table S11). Since a similar trend was also evident comparing other existing curated E. coli PI datasets (derived from either low-throughput or other high-throughput studies) with independent GC inferences (e.g., from STRING; Table S11), this presumably stems in part from the incomplete coverage obtained by these different approaches. Regardless, these observations imply that the union of PI and GC networks is necessary to capture the widest spectrum of biologically relevant interactions. Indeed, it has been shown previously that the combination of PI with functional genomic inferences, each statistically weighted according to dataset quality, can markedly improve both functional coverage and accuracy [59,61,70–72]. We therefore merged our experimental and predicted associations with the same method used to generate the unified GC network (Figure 2 The resulting combined probabilistic network consisted of 80,370 high-confidence (probability ≥75%) putative pairwise interactions encompassing virtually the entire proteome of E. coli, including 2,769 (99%) annotated proteins and 1,375 (96%) functional orphans (Table S12). Graph analysis of this final integrated network (Protocol S7 and Table S13) indicated that the orphans tended to have a lower overall connectivity and betweenness centrality, measured as the number of shortest paths going through a given node, relative to annotated components, suggesting more peripheral positions in the integrated networks. However, the orphans also exhibited lower average closeness, defined as the average length of shortest paths between any two nodes, and had similar overall clustering coefficients, indicating that, in general, the orphans are functionally connected to, rather than isolated from, the annotated gene products. These observations implied that consideration of both the individual associations and overall placement of the orphans within the integrated interaction network would facilitate functional deduction. We therefore devised a new network-based function prediction method (termed StepPLR; see Figure 2 As shown in Figure S4, StepPLR had better precision and recall prediction performance than several other widely used guilt-by-association procedures tested, such as majority-counting and chi-square–based methods (see Protocol S10 for details). Although the performance achieved for the different functional categories varied, our approach generated area-under-the-curve (AUC) values of 0.8 or higher for most of the COG (83%), GO (67%), and MultiFun (53%) categories (Table S15), and was relatively insensitive to the number of annotated proteins per function. Moreover, since our method exploited the correlation among the different categories, most orphans had multiple biologically consistent predicted functions (Table S16). Functional Neighborhoods As displayed graphically in Figure 5
We also examined an alternate group of orphans (YafP, YiaD, and YbcM) associated with the flagellar biogenesis and motility apparatus (Figure 5 Many other orphans were predicted to have roles in other conserved biological systems, such as DNA replication. For example, as shown in Figure 5 Novel Components of Bacterial Cell Envelope Biogenesis Pathways Our functional predictions were particularly revealing about bacterial cell envelope biology, with implications for infectious disease and antibiotic susceptibility. Like other free-living microbes, E. coli is encased in a membranous cell envelope composed of proteins, lipids, and carbohydrates that serves as the interface to its environment and mammalian host, yet over a third of the approximately 1,000 predicted membrane-associated and periplasmic proteins of E. coli are presently functionally unannotated [38]. Figure 6
Figure 6 Other orphans in this neighborhood (e.g., YiiD, YbjT, and YbcH) are predicted to interact either with the Rff and Rfb pathways involved in the generation of other important cell envelope constituents, such as enterobacterial common antigen and O-antigen. Likewise, YibD and YafL interact with Wca and Wz proteins that participate in biosynthesis of colanic acid (M-antigen). Moreover, although they are not formally classified within this route, we detected functional interactions of RfaD with LpxD, KdsA, and KdsC involved in the synthesis of ADP-l-glycero-d-manno-heptose, which is used by many enzymes for LPS production. Similar to mutants deficient for enzymes annotated to participate in the cell envelope biogenesis, we found that deletions of the orphans in this neighborhood significantly (p < 0.05; see Protocol S11) perturbed cell viability upon exposure to antibiotics that block formation of the bacterial cell envelope (Figure 6Functional Interactions of Orphans Extend beyond Proteobacteria To investigate the evolutionary significance of the putative functional associations detected in E. coli, we examined the presence of orthologs of each of the interacting orphan and annotated protein pairs among currently available prokaryotic genomes (see Materials and Methods and Protocol S12 for details). As might be expected from the more limited evolutionary scope of the orphans (cf. Figure 1
We next compared the phylogenetic distributions of the orphan and annotated components of the 97 predicted functional neighborhoods with at least one orphan (Figure 7 Conversely, annotated components exhibited a broader phylogenetic distribution in the remaining 36 neighborhoods. For example, two orphans (YneG and YdaU) linked to DNA replication (Figure 7 Discussion Defining the precise biological roles and relationships of bacterial gene products in an often dynamically changing physiological context is a challenging proposition. Historically, systematic assessments of protein function in bacteria have tended to rely on molecular inferences based on sequence alignments and domain architectures, whereas experimental characterization has traditionally been driven by specific scientific interests rather than with the aim of providing the broader community with unbiased collections of functionally related proteins and phenotypes. Since the biological role of a protein is not necessarily reflected in its primary sequence, the elucidation of molecular interaction networks can provide an alternate perspective even in the absence of detailed phenotypic data [16,71]. Here, we have opted to view a model microbial cell mechanistically as a series of modular molecular interaction networks that underlie the major biochemical processes that mediate cell homeostasis and proliferation, wherein the functional attributes of particular gene products are reflected in their overall patterns of associations. To this end, we have generated an extensive compendium of physical and functional linkages covering almost the entire protein-coding complement of E. coli. This led to the elucidation of hundreds of putative soluble multiprotein complexes and functional modules encompassing virtually all the many gene products currently lacking public annotations. Although existing integrative probabilistic interaction databases such as STRING [61] and EcID [82] provide valuable additional binary interactions that are potentially useful for protein function prediction or as complementary evidence to those reported in this study, our machine learning strategy goes beyond describing binary interactions by explicitly describing the most probable biological functions of the orphans. Of particular noteworthiness, our functional predictions and phylogenetic projections associate a sizeable fraction of the functional orphans with core bacterial processes, suggesting they may have previously eluded detection in part due to prior analytical biases. Since the various methods used in this study to discover different types of molecular relationships also have their own intrinsic biases, complementary information was obtained through data integration. The limited overlap between the high-confidence physical and functional interaction networks presumably stems in part from the incomplete coverage typically achieved by high-throughput experiments and their methodological differences [13,83]. For example, certain orphans were difficult to evaluate by GC methods due to a lack of apparent orthologs at medium-to-high evolutionary distances, which hinders comparative genomic inferences. Likewise, although we performed large-scale tandem affinity tagging and purification under near-native physiological conditions to generate highly purified preparations of stable, endogenous multiprotein complexes, we did not achieve complete coverage of the proteome. We did not attempt to purify a large number of membrane-associated proteins, which require specialized solubilization procedures, whereas the soluble proteins that we failed to tag or detect by mass spectrometry were presumably either of very low abundance or not expressed in our growth conditions. Comparison of our physical interaction network with analogous public datasets produced for other model species, such as worm, fly, yeast, and even the bacterium Helicobacter pylori, revealed very limited (<1%) overlap. These observations are congruent with recent findings by Rajagopala and colleagues [13] showing that only a third (49) of the 173 experimentally derived PI in the cell motility network of the spirochete Treponema pallidum predicted to occur in the -proteobacteria Campylobacter jejuni on the basis of orthology could subsequently be confirmed by targeted two-hybrid testing. The limited overlap between proteomic datasets presumably reflects a combination of incomplete coverage by various experimental assays, methodological differences and evolutionary divergence.The observation that the intersection of functional genomics inferences with low-throughput curated physical interaction data is somewhat higher (cf. Table S11) might be explained by two nonmutually exclusive ways: first, protein–protein interactions reported in the literature based on traditional biochemical methods might be biased towards the most evolutionarily conserved multiprotein complexes, which tend to be enriched for essential components with broadly distributed phylogenetic profiles that are more easily and accurately predicted by GC methods; second, the relatively high sensitivity of the two complementary forms of protein mass spectrometry used in this study may have resulted in the detection of lower-abundance orphan proteins that have previously not been studied in depth. The last point is consistent with the notion that different proteomic methods capture different PI types [83]. Hence, alternative proteomic methods, such as two-hybrid screens [13,84–86] or in vivo protein-fragment complementation assays [87], may be better suited for detecting certain PI currently underrepresented in our dataset. In a similar vein, additional functional relationships will undoubtedly be uncovered by different experimental and computational procedures, such as high-throughput comparative analysis of mutant cellular phenotypes [2], genome-wide genetic interaction screens [88,89], and automated text mining [90,91]. The topological properties inherent to biological networks (e.g., their hierarchical organization and degree distributions) combined with incomplete interactome coverage make establishing definitive functional groupings difficult [92]. Our approach was to take into account both the correlations among functional categories and the overall topological structure of the integrated network to generate a more balanced probabilistic model. Whereas alternative methods may provide enhanced interpretations of the organizational properties of the PI and GC networks, the functional enrichment and experimental validations established here suggest that our network-based computational inferences provide a reasonable perspective for exploring bacterial protein function. Similar strategies have resulted in powerful predictors of protein function in Eukaryotes [49,72,93–95]. The potential tradeoff is that additional error or uncertainty may have occasionally been introduced by assuming functional similarity among more loosely connected proteins. Moreover, the probabilities associated with particular functional terms may not be directly comparable. Functional orphans associated with very well-characterized biological processes are more likely to be correctly assigned by computational methods [72], whereas those associated with relatively poorly studied pathways will tend to remain obscure. Nonetheless, they can be grouped together on the basis of specific PI, GC, or even other functional associations (cf. Figure 7 In general, the high-confidence functional relationships we inferred for E. coli could be validated by independent experimental tests, and can be extrapolated to other bacterial species, including pathogens. Over 35% of the orphans find orthologs as far away as Archaea, and hence are likely associated with the same basic housekeeping processes we predict for E. coli, such as formation of the cell wall and protein synthesis. For instance, we have established putative roles in sugar and lipid metabolic pathways for several dozen evolutionarily conserved orphans that appear to be critical for proper biogenesis of the bacterial cell envelope, and hence may represent novel targets for antibiotic development. Conversely, our systematic comparisons also revealed some unique aspects of the orphans in the evolutionary history of E. coli, such as the potential fimbriae factors that appear to be restricted to Enterobacteriaceae. One interpretation is that orphans with limited phylogenetic distributions contribute to fine tuning of adaptive physiological responses upon changing environmental conditions, as previously suggested for peripheral metabolic genes acquired by horizontal transfer [96]. Conversely, the fact that the interactions of orphans with annotated proteins show a higher proportion of conservation across taxa implies that conserved biological systems are still to be discovered, and whose member contributions could extend across evolutionary domains. The physical and functional associations reported here are therefore presented as a Web-accessible public resource called “eNet” (http://ecoli.med.utoronto.ca; see Protocol S13 for details) to facilitate exploration of the fundamental molecular biology of bacteria in general and for hypothesis-driven studies of unique aspects pertaining to E. coli more specifically. Materials and Methods PI network generation. Large-scale SPA tagging and purifications were performed essentially as previously described [4,32]. Briefly, a DNA cassette encoding the SPA-tag and a selectable marker flanked by gene-specific targeting sequences was amplified by PCR using primers with homology to a selected locus. The cassette was then transformed and integrated using homologous recombination in the lysogenic E. coli strain DY330 (W3110 background), which harbors the highly efficient λ-phage–encoded homologous recombination enzymes exo, bet, and gam under the control of the temperature-sensitive CI857 repressor (the “Red” system), to create a C-terminal fusion with the protein of interest. Strains in which the PCR product was integrated were subjected to antibiotic selection, and tagged protein expression was confirmed by western blotting. Tagging primer sequences are available upon request. Two complementary mass spectrometry techniques (gel-based MALDI peptide mass fingerprinting and gel-free LCMS shotgun sequencing) were used to detect physically interacting proteins. Details about the large-scale strain culture, protein extraction and purification, and protein identification procedures are provided in Protocol S3. Scoring of tentative PI from the LCMS and MALDI assays was conducted using a logistic regression model using reference PI obtained by low-throughput experiments curated in the DIP, BIND, and IntAct databases [35–37] as a positive training set. Our negative training set consisted of pairs of proteins in which one component was experimentally determined or predicted with high confidence to be cytoplasmic and the other residing in the outer membrane or the periplasm [38]; inner membrane proteins were discarded from this negative dataset since they are in physical proximity (and hence could potentially physically interact) to cytoplasmic and periplasmic proteins. Our logistic regression procedure also took into account the degree of consistency of copurifying protein pairs, balancing the tradeoff between “spoke” and “matrix” representation models of interactions within copurified groups of proteins to decrease the false discovery rate. We then combined the scores derived from LCMS and MALDI into a a single PI network using a previously established procedure for integrating probabilistic networks [61], which assumes the reliabilities of associations generated by these methods are independent (see Protocol S6 for details). To facilitate independent critical evaluation, all our processed interaction data is available through the Web site in HUPO-PSI molecular interaction reporting format (standard level 2.5). GC network generation. The four GC methods used to predict functional interactions among E. coli proteins were based on: (1) functional linkages among genes which fuse to form a single open reading frame in at least one other genome, i.e., gene fusion [48]; (2) the mutual information of the coordinated presence or absence of pairs of genes across a set of 440 nonredundant genomes, i.e., phylogenetic profiles [51,97]; and the natural chromosomal association of bacterial genes in operons as detected by two alternative methods, namely (3) the tendency of genes forming operons to show small intergenic distances [98,99], and (4) the conservation of gene order, in which a confidence value for each pair of adjacent genes in the same strand was used as indicator that those genes likely form an operon, as compared with the conservation of adjacent genes in opposite strands [53]. For the last two methods, subsequent operon rearrangements were detected by genomic mapping of orthologs across 440 nonredundant bacterial genomes [55]. For all four GC methods, we used the BLAST-BDBHs as an operational definition of orthology (see Protocol S5 for details). To avoid circularity, the prediction scores of the four GC methods were benchmarked separately using proteins belonging to the same metabolic pathway according to EcoCyc [11] as positive reference set, and proteins in different pathways as negatives (Protocol S5). A single, unified high-confidence functional association network was then constructed by integrating the interaction predictions generated by the four GC methods using the same scoring model [61] used to integrate the MALDI and LCMS data (Protocol S6). Clustering of networks. Protein clusters were generated from three different networks using MCL [40] (Figure 2 Network-based function prediction and benchmarking. Our algorithm (StepPLR) for assigning biological functions is essentially a network topology–based method in which the functions of the orphans are predicted based on the functions of their associated annotated proteins in the immediate (direct) and adjacent (indirect) network vicinity (see Protocol S9 for details). Briefly, a single network integrating the high-confidence PI and GC probabilistic networks was first created using the same scoring model [61] used to integrate the PI data and the four GC networks. The weighted topological overlap [101] between each pair of protein nodes in the integrated network was then calculated to determine the correlated functional profiles based on a penalized logistic regression model (see Protocol S8 for details). Finally, a stepwise variable selection procedure to optimize function profiles in the final logistic regression was used (see Protocol S9 for details). Only functional categories with at least 15 annotated E. coli proteins were used in our integrated functional association network (see Table S14): 18 COG classes, corresponding to bacterial protein functions; 19 biological classes from MultiFun, in which the proteins can have multiple annotations based on different classification criteria; and 51 biological process classes in GO. Other guilt-by-association representative methods (e.g., majority-counting and chi-square–based) were also evaluated (results shown in Figure S4A). Expanded descriptions of benchmarking and other computational procedures of our function prediction algorithm are provided in Protocols S9 and S10. Experimental validation of functional predictions. Orphans were selected for experimental validation of functional predictions based on the following criteria: (1) the orphan was predicted to perform a function for which a suitable phenotypic assay was previously reported (e.g., an antibiotic targeting the associated function was available); (2) the orphan was clearly grouped with select annotated genes, allowing the inclusion of positive as well as negative controls; and (3) the orphan had high (>0.8 confidence) function prediction score(s). Antibiotic susceptibility assays were performed by pinning orphan and annotated gene knockout mutants [2] onto solid media plates in the presence or absence of antibiotics, and then imaging and comparing colony sizes. Details of the antibiotic sensitivity, translation, and auxotrophy assays are provided in Protocol S11. Motility assays were performed with overnight E. coli strain cultures pinned onto rectangular Petri dishes (Singer) containing semisolid swarming agar (LB medium with 0.25% agar). The swarming phenotype was classified visually based on the cell spreading-halo diameter observed after approximately 8 h incubation at 32 °C. Biofilm formation assays were conducted essentially as described in [102], with replicate data normalized relative to wild-type controls (Protocol S11). Epistatic genetic interactions between pairs of gene mutants in E. coli were identified using a newly developed conjugation-based screening method [88]. Briefly, a drug resistance–marked query gene deletion in a high-frequency recombination donor strain was crossed into either single-gene deletion knockout mutants from the Keio strain collection [2] or select essential gene hypomorphs to generate double mutants. After double drug selection, synthetic lethal or sick phenotypes were scored visually according to measured colony sizes (Protocol S11). Figure S1: Influence of Gene Expression and Subcellular Localization on PI Detection (37 KB PDF) Click here for additional data file.(37K, pdf) Figure S2: Functional Homogeneity, Connectivity, and Cluster Size in PI Complexes and GC Modules (38 KB PDF) Click here for additional data file.(38K, pdf) Figure S3: Auxotrophy of ydiBΔ and ydiNΔ for Shikimic- and Aromatic Amino Acids (620 KB PDF) Click here for additional data file.(620K, pdf) Figure S4: Precision and Recall Benchmark Analysis of Function Prediction Algorithms (29 KB PDF) Click here for additional data file.(29K, pdf) Figure S5: Clustered Annotation Terms and Functional Neighborhoods (348 KB PDF) Click here for additional data file.(348K, pdf) Protocol S4: Clustering to Define Protein Complexes, Functional Modules, and Neighborhoods (34 KB DOC) Click here for additional data file.(34K, doc) Protocol S5: Prediction of Functional Interactions by Genomic Context (72 KB DOC) Click here for additional data file.(72K, doc) Protocol S6: Global Integration of Different Data Sources for Function Prediction (30 KB DOC) Click here for additional data file.(30K, doc) Protocol S7: Analysis of Topological Network Properties (28 KB DOC) Click here for additional data file.(28K, doc) Protocol S8: Calculating Node Similarity in the Integrated Functional Association Network (34 KB DOC) Click here for additional data file.(34K, doc) Protocol S9: Network-Based Protein Function Prediction (154 KB DOC) Click here for additional data file.(154K, doc) Protocol S10: Comparison of Our New Algorithm (StepPLR) with Established Methods (33 KB DOC) Click here for additional data file.(33K, doc) Protocol S11: Experimental Validations of Functional Predictions (47 KB DOC) Click here for additional data file.(47K, doc) Protocol S12: Analysis of Functional Interactions of Orphans Extend beyond Proteobacteria (22 KB DOC) Click here for additional data file.(22K, doc) Table S1: E. coli K-12 (W3110) Gene Annotations and Properties (1.73 MB XLS) Click here for additional data file.(1.6M, xls) Table S2: Genomic and Metagenomic Conservation of E. coli K-12 (W3110) Genes (2.17 MB XLS) Click here for additional data file.(2.1M, xls) Table S3: Performance Comparison of Different Methods in PI Analysis (8 KB PDF) Click here for additional data file.(8.2K, pdf) Table S5: Promiscuous “Hub” Proteins Filtered from the PI Dataset (7 KB PDF) Click here for additional data file.(7.1K, pdf) Table S8: Integrated Functional Interaction Data Generated by Genomic Context (4.08 MB TXT) Click here for additional data file.(3.9M, txt) Table S10: Association of the Orphan Genes in Fimbrial Module to Biofilm Gene Expression Studies (30 KB XLS) Click here for additional data file.(30K, xls) Table S11: Comparison of PI versus GC Interactions (8 KB PDF) Click here for additional data file.(8.1K, pdf) Table S13: Topological PI, GC, and Integrated Network Properties (24 KB XLS) Click here for additional data file.(24K, xls) Table S14: Gold Standards of Functional Categories Used for Function Prediction (1.15 MB XLS) Click here for additional data file.(1.1M, xls) Table S15: Function Prediction Performance Measured by AUC Scores (35 KB XLS) Click here for additional data file.(35K, xls) Table S18: Relative Fitness and Functional Enrichment Analyses of Orphan and Annotated Genes Using Drug Screens (151 KB XLS) Click here for additional data file.(151K, xls) Acknowledgments We thank John Parkinson, Quaid Morris, and Radhakrishnan Mahadevan for helpful suggestions; and Ziqi Zhang and other members of the Emili, Greenblatt, and Moreno-Hagelsieb laboratories for technical assistance and critical input. GM-H acknowledges having used computer resources from the SHARCNET (Shared Hierarchical Academic Research Computing Network) consortium. Abbreviations
Footnotes ¤ Current address: Life Science Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America Author contributions. GM-H and AE conceived and designed the experiments. PH, SCJ, MB, JJDM, GB, WY, OP, XG, PW, SC, CC, AN-A, NKK, GM, MA, NN, VE, AG, AP, and GM-H performed the experiments. PH, SCJ, MB, JJDM, GB, SP, and GM-H analyzed the data. JFG provided critical suggestions and data interpretations. PH, SCJ, MB, JJDM, GB, JFG, GM-H, and AE wrote the paper. Funding. This work was supported with grants from the Canadian Institutes of Health Research (MOP-77639), the Ontario Research and Development Challenge Fund, Genome Canada, and the Ontario Genomics Institute to JFG and AE, and from the Natural Sciences and Engineering Research Council of Canada to AG and GM-H. Competing interests. The authors have declared that no competing interests exist. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Genome Res. 2006 May; 16(5):686-91.
[Genome Res. 2006]Nucleic Acids Res. 2006; 34(1):1-9.
[Nucleic Acids Res. 2006]Proc Natl Acad Sci U S A. 2005 Dec 27; 102(52):19103-8.
[Proc Natl Acad Sci U S A. 2005]PLoS Biol. 2007 Jan; 5(1):e8.
[PLoS Biol. 2007]J Bacteriol. 2006 Dec; 188(23):8259-71.
[J Bacteriol. 2006]Nucleic Acids Res. 2006; 34(1):1-9.
[Nucleic Acids Res. 2006]J Bioinform Comput Biol. 2007 Feb; 5(1):1-30.
[J Bioinform Comput Biol. 2007]Mol Syst Biol. 2007; 3():128.
[Mol Syst Biol. 2007]PLoS Comput Biol. 2008 Apr 18; 4(4):e1000065.
[PLoS Comput Biol. 2008]Nat Biotechnol. 2006 Apr; 24(4):427-33.
[Nat Biotechnol. 2006]Genome Res. 2006 Mar; 16(3):374-82.
[Genome Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D300-2.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D334-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D866-70.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D501-4.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D300-2.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D334-7.
[Nucleic Acids Res. 2005]Microbiol Mol Biol Rev. 1998 Sep; 62(3):985-1019.
[Microbiol Mol Biol Rev. 1998]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Genome Res. 2003 Feb; 13(2):216-23.
[Genome Res. 2003]PLoS Biol. 2007 Jan; 5(1):e8.
[PLoS Biol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D235-9.
[Nucleic Acids Res. 2004]PLoS Comput Biol. 2007 Apr 27; 3(4):e43.
[PLoS Comput Biol. 2007]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]J Proteome Res. 2004 May-Jun; 3(3):463-8.
[J Proteome Res. 2004]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]Bioinformatics. 2008 Apr 1; 24(7):979-86.
[Bioinformatics. 2008]BMC Bioinformatics. 2006 Jul 26; 7():360.
[BMC Bioinformatics. 2006]Nucleic Acids Res. 2000 Jan 1; 28(1):289-91.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2003 Jan 1; 31(1):248-50.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2007 Jan; 35(Database issue):D561-5.
[Nucleic Acids Res. 2007]FEMS Microbiol Rev. 2009 Jan; 33(1):66-97.
[FEMS Microbiol Rev. 2009]Genome Res. 2006 May; 16(5):686-91.
[Genome Res. 2006]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]Bioinformatics. 2006 Sep 1; 22(17):2178-9.
[Bioinformatics. 2006]Genome Res. 2006 May; 16(5):686-91.
[Genome Res. 2006]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]BMC Bioinformatics. 2006 Nov 6; 7():488.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2006 Nov 6; 7():488.
[BMC Bioinformatics. 2006]Bioinformatics. 2004 Nov 22; 20(17):3013-20.
[Bioinformatics. 2004]Nature. 2006 Mar 30; 440(7084):637-43.
[Nature. 2006]J Bacteriol. 2003 Oct; 185(20):6158-70.
[J Bacteriol. 2003]RNA. 2007 Jan; 13(1):55-64.
[RNA. 2007]J Bacteriol. 2006 Oct; 188(19):6757-70.
[J Bacteriol. 2006]J Bacteriol. 2007 May; 189(9):3434-44.
[J Bacteriol. 2007]Nature. 1999 Nov 4; 402(6757):86-90.
[Nature. 1999]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Microb Comp Genomics. 1998; 3(4):199-217.
[Microb Comp Genomics. 1998]Proc Natl Acad Sci U S A. 1999 Apr 13; 96(8):4285-8.
[Proc Natl Acad Sci U S A. 1999]Bioinformatics. 2006 Jul 1; 22(13):1623-30.
[Bioinformatics. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005; 33(8):2521-30.
[Nucleic Acids Res. 2005]Genome Res. 2007 Apr; 17(4):527-35.
[Genome Res. 2007]Genome Biol. 2004; 5(5):R35.
[Genome Biol. 2004]EMBO J. 2008 Sep 3; 27(17):2271-80.
[EMBO J. 2008]J Mol Biol. 2002 Nov 8; 323(5):845-57.
[J Mol Biol. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D235-9.
[Nucleic Acids Res. 2004]Microbiol Mol Biol Rev. 2007 Dec; 71(4):551-75.
[Microbiol Mol Biol Rev. 2007]Environ Microbiol. 2007 Feb; 9(2):332-46.
[Environ Microbiol. 2007]J Biotechnol. 2006 Dec 1; 126(4):528-45.
[J Biotechnol. 2006]J Bacteriol. 1992 Jan; 174(2):525-9.
[J Bacteriol. 1992]FEMS Microbiol Rev. 2009 Jan; 33(1):66-97.
[FEMS Microbiol Rev. 2009]Science. 2004 Nov 26; 306(5701):1555-8.
[Science. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nat Rev Genet. 2007 Sep; 8(9):699-710.
[Nat Rev Genet. 2007]Bioinformatics. 2007 Sep 1; 23(17):2322-30.
[Bioinformatics. 2007]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D262-6.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D300-2.
[Nucleic Acids Res. 2004]Mol Syst Biol. 2007; 3():128.
[Mol Syst Biol. 2007]J Bacteriol. 2006 Apr; 188(8):2945-58.
[J Bacteriol. 2006]PLoS Genet. 2007 Sep; 3(9):1644-60.
[PLoS Genet. 2007]BMC Bioinformatics. 2005 Jul 12; 6():172.
[BMC Bioinformatics. 2005]FEMS Microbiol Rev. 2009 Jan; 33(1):66-97.
[FEMS Microbiol Rev. 2009]Science. 2005 May 27; 308(5726):1321-3.
[Science. 2005]J Mol Biol. 2004 May 14; 338(5):1027-36.
[J Mol Biol. 2004]J Biol Chem. 2005 Apr 8; 280(14):14154-67.
[J Biol Chem. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D235-9.
[Nucleic Acids Res. 2004]J Biol Chem. 2007 Dec 7; 282(49):36077-89.
[J Biol Chem. 2007]FEMS Microbiol Rev. 2009 Jan; 33(1):66-97.
[FEMS Microbiol Rev. 2009]Nat Genet. 2008 Feb; 40(2):181-8.
[Nat Genet. 2008]Genome Res. 2008 Apr; 18(4):644-52.
[Genome Res. 2008]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2009 Jan; 37(Database issue):D629-35.
[Nucleic Acids Res. 2009]Mol Syst Biol. 2007; 3():128.
[Mol Syst Biol. 2007]Science. 2008 Oct 3; 322(5898):104-10.
[Science. 2008]Mol Syst Biol. 2007; 3():128.
[Mol Syst Biol. 2007]Science. 2008 Oct 3; 322(5898):104-10.
[Science. 2008]Mol Syst Biol. 2007; 3():128.
[Mol Syst Biol. 2007]Genome Biol. 2007; 8(7):R130.
[Genome Biol. 2007]PLoS One. 2008 May 28; 3(5):e2292.
[PLoS One. 2008]Science. 2008 Jun 13; 320(5882):1465-70.
[Science. 2008]Mol Syst Biol. 2007; 3():88.
[Mol Syst Biol. 2007]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Bioinformatics. 2007 Sep 1; 23(17):2322-30.
[Bioinformatics. 2007]Bioinformatics. 2005 Aug 1; 21(15):3217-26.
[Bioinformatics. 2005]Nat Biotechnol. 2000 Dec; 18(12):1257-61.
[Nat Biotechnol. 2000]Nat Genet. 2005 Dec; 37(12):1372-5.
[Nat Genet. 2005]Nature. 2005 Feb 3; 433(7025):531-7.
[Nature. 2005]J Proteome Res. 2004 May-Jun; 3(3):463-8.
[J Proteome Res. 2004]Nucleic Acids Res. 2000 Jan 1; 28(1):289-91.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2007 Jan; 35(Database issue):D561-5.
[Nucleic Acids Res. 2007]FEMS Microbiol Rev. 2009 Jan; 33(1):66-97.
[FEMS Microbiol Rev. 2009]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nature. 1999 Nov 4; 402(6757):86-90.
[Nature. 1999]Proc Natl Acad Sci U S A. 1999 Apr 13; 96(8):4285-8.
[Proc Natl Acad Sci U S A. 1999]Proteins. 2008 Feb 1; 70(2):344-52.
[Proteins. 2008]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6652-7.
[Proc Natl Acad Sci U S A. 2000]Nucleic Acids Res. 2004; 32(18):5392-7.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D334-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D433-7.
[Nucleic Acids Res. 2005]Stat Appl Genet Mol Biol. 2005; 4():Article17.
[Stat Appl Genet Mol Biol. 2005]Methods Enzymol. 1999; 310():91-109.
[Methods Enzymol. 1999]Nat Methods. 2008 Sep; 5(9):789-95.
[Nat Methods. 2008]