NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details


and .

The use of genome sequences to solve biological problems has been afforded its own label; for better or worse, it's called "functional genomics."

David J. Galas. Making Sense of the Sequence. Science, 2001, vol. 291, p. 1257

When the completion of the draft of the human genome sequence was announced on June 26, 2000, all the parties involved agreed that the major task of identifying the functions of all human genes was still many years ahead. In fact, even the much simpler task of mapping all the genes in the final version of the human genome sequence that should become available within the next few years remains a major problem. Identification of all protein-coding genes in the genome sequence and determination of the cellular functions of the proteins encoded in these genes can be accomplished only by combining powerful computational tools with a variety of experimental approaches from the arsenals of biochemistry, molecular biology, genetics and cell biology. Linking sequence to function and both to the evolutionary history of life is the fundamental task of new biology.

This book is devoted to the principles, methods and some achievements of computational comparative genomics, which has shaped up as a separate discipline only in the last 5-7 years. Its beginnings have been modest, with only the genome sequences of viruses and organelles determined in the 1980’s. These sequences were important for their respective disciplines and as a test ground for computational methods of genome analysis, but they were not particularly helpful for understanding how an autonomous cell works. By 1992, the first chromosomes of baker’s yeast and large chunks of bacterial genomes started to emerge, and researchers began pondering the question: What’s in the genome? The breakthrough came in 1995 with the complete sequencing of the first genome of a cellular life form, the bacterium Haemophilus influenzae. The second bacterial genome, Mycoplasma genitalium, followed within months. The next year, the first complete genomes of an archaeon (Methanococcus jannaschii) and a eukaryote (yeast Saccharomyces cerevisiae) became available. Many more microbial genomes followed, and in 1999, the first genome of a multicellular eukaryote, the nematode Caenorhabiditis elegans, has been sequenced. The year 2000 brought us the complete genomes of the fruit fly Drosophila melanogaster and the thale cress Arabidopsis thaliana, and two independent drafts of the human genome followed suit in 2001. Thus, we entered the 21st century already having at hand this 3.2 billion-letter text that has been referred to as the Book of Life, as well as a number of accompanying books on other life forms. The challenge is now to read and interpret them.

To extract biological information from enormous strings of As, Cs, Ts, and Gs, functional genomics depends on computational analysis of the sequence data. It is unrealistic to expect that every single gene or even a majority of the genes found in the sequenced genomes would ever be studied experimentally. However, using the relatively cheap and fast computational approaches, it is usually possible to reliably predict the protein-coding regions in the DNA sequence with reasonable (albeit varying) confidence and to get at least some insight into the possible functions of the encoded proteins. Such an analysis proves valuable for many branches of biology, in large part, because it assists in classification and prioritization of the targets for future experimental research.

Computations on genomes are inexpensive and fast compared to large-scale experimentation, but it would be a mistake to equate this with ‘easy’. The history of annotation and comparative analysis of the first sequenced genomes convincingly (and sometimes painfully) shows that the quality and utility of the final product critically depend on the employed methods and the depth of interpretation of the results obtained by computer methods. Unfortunately, errors produced in the course of computer analysis are propagated just as easily as real discoveries, which makes development of reliable protocols and crystallization of the accumulating experience of genome analysis in easily accessible forms particularly important.

While functional annotation of genomes may be the most obvious, and in a sense, the most important purpose of computational genomics, it is not just a supporting service for experimental functional genomics, but a discipline in itself, with its own fundamental goals. The main such goal is understanding genome evolution . Ultimately, understanding here means being able to reconstruct the most likely sequence of evolutionary events that produced these genomes. Attaining this goal will require many more genomes, development of new algorithms, and years of careful analysis. Nevertheless, even in its infancy, comparative genomics has brought genuine revelations about evolution. We believe that the principal news that could not be easily foreseen in the pre-genomic era is the extreme diversity of the gene composition in different evolutionary lineages. This strongly suggests that, at least among prokaryotes, horizontal gene transfer and lineage-specific gene loss were major, formative evolutionary forces, rather than rare and relatively inconsequential events as assumed previously. Accordingly, the straightforward image of evolution as the growth of the tree of life is replaced by one of a ‘grove’, in which vertical, tree-type growth does occur, but multiple horizontal connections are equally prominent—an incomparably more complex, but also more interesting, picture of life than ever suspected before.

This book describes the computational approaches that proved to be useful in analyzing complete genomes. It is intended for a broad range of biologists, including experimental biologists and graduate and advanced undergraduate students, whose work builds upon the results of genome analysis and comprises the foundation of functional genomics. However, we attempted to make the text interesting also for practitioners of genomics itself, particularly those computational biologists whose main occupation is developing algorithms and programs for genome analysis and who could benefit from an accessible discussion of some biological implications of these methods. Most of the approaches discussed in this book have been developed during comparative analysis of the first set of completely sequenced bacterial and archaeal genomes, which are simpler and more amenable to straightforward computational dissection than the much larger eukaryotic genomes. We show, however, that the main principles remain the same for comparative genomics in general.

The book starts with a brief overview of the history of genomics. We list the completed and ongoing genome sequencing projects and show how little is actually known, even about simple genomes. We then discuss the conceptual basis of comparative genomics, emphasizing the evolutionary principles of protein function assignments. The book then proceeds to discuss the databases that store and organize genomic data, with their unique advantages and pitfalls. Familiarity with these databases is useful for any biologist, but for those interested in functional or evolutionary genomics, it is essential.

The central part of the book discusses, in some depth, the principles and methods of genome analysis and annotation, including identification of genes in genomic DNA sequence and using sequence comparisons for functional annotation of predicted proteins. We introduce the most common sequence similarity search methods and discuss the ways to automate the searches and increase search sensitivity, while minimizing the error rate. The common sources of errors in functional annotation of genomes are discussed, and some simple rules of thumb are provided that may help avoid them. We further focus on the approaches to functional prediction that rely on the genome context, such as examination of phyletic patterns, gene (domain) fusions, and conserved gene strings (operons). The discussion is illustrated by examples from comparative genomics of prokaryotes.

The remaining parts of the book consider fundamental and practical applications of comparative genomics. In particular, in Chapter 6, we discuss the impact of comparative genomics on our current understanding of several fundamental problems of evolutionary biology and some major events of life’s history.

The book is non-technical with respect to the computer methods for genome analysis; we discuss these methods from the user’s viewpoint, without addressing mathematical and algorithmic details. Prior practical familiarity with the basic methods for sequence analysis is a major advantage, but a reader without such experience should be able to use the book as an introduction to these methods. Knowledge of molecular biology and genetics at the level of basic undergraduate courses is required for understanding the material; similar knowledge of microbiology is a plus. The book is accompanied by a problem set, designed to be solved by using tools available through the web. Hopefully, this will allow the reader to develop a better feeling for the practical use of the methods discussed in the text. Chapters 1 through 5 are, definitely, at the introductory level, although we attempted to include some non-trivial examples and discussion of open issues. There is considerable cross-talk between Chapters 3 and 4, which might be perceived as a degree of redundancy. We felt, however, that it was appropriate to discuss some key notions in protein analysis twice, first from a purely practical and then from a more fundamental standpoint. Chapters 6, 7, and 8 are somewhat more involved and, we hope, might be of certain interest even to experts. However, we tried to ensure that a non-expert reader would be in a position to understand the material of these chapters after reading the book from the beginning.

Probably the main purpose of any Preface is a disclaimer and apologies. So what is not in this book? First of all, we could not even think of covering the entire field of comparative genomics: this field is young but has already branched widely, and we cannot claim even knowing of all important research directions, let alone being experts in them. We cite many publications, but, again, we could not even think of citing all the relevant ones: this would take the entire space of the book and the task still would not have been accomplished. We sincerely apologize to all those colleagues whose important work is not cited because of space considerations or, unfortunately, because of our ignorance and negligence. Most of the case studies discussed in this book are drawn from our own work. This is certainly not to imply that we believe it to be in any sense superior to the work of others, but simply because this is what we know best. However, unfortunately, there may be cases where, for the above reason, we cite and discuss our own work instead of more decisive and interesting work of other researchers, and to them our heartfelt apologies.

The parts of this book that deal with sequence and structure analysis algorithms might irk some of our colleagues involved in the development of these methods by superficiality and lack of rigor. We owe a great debt to these researchers and extend our regrets and apologies. A more technical point: most of the research discussed in this book is done with protein sequences and structures. Partly, this is because we believe that the main knowledge so far accumulated by comparative genomics has been attained through this type of analysis. The other reason, however, is that this is where our main experience is, and we apologize to the readers for not covering numerous important studies on non-coding regions of the genomes. Finally, a terminological point related to the last issue: throughout the book, we rather freely substitute proteins for the genes that encode them by talking about duplications, mutations and other evolutions of proteins. This is just for the sake of brevity; we assure the reader that we are aware of the fact that proteins actually do not undergo any of these events, only the respective genes do.

Despite of all these shortcomings and, undoubtedly, others that we are unaware of, we hope that this book will help the reader to understand the principles and approaches of comparative genomics and the potential and limitations of computational and experimental approaches to genome analysis. This should go some distance to building a bridge across the "digital divide" between biologists and computer scientists, hopefully, allowing biologists of various directions and persuasions to better grasp the peculiarities of the emerging field of Genome Biology and to learn how to benefit from the enormous amount of sequence and structural data available in the public databases.

This book has become possible thanks to our close collaboration with numerous colleagues from the NCBI and other institutions. It is, unfortunately, impossible to mention everyone, but we must gratefully acknowledge many hours of illuminating discussions over the years of interactions with L. Aravind, Peer Bork, Valerian Dolja, Mikhail Gelfand, Alexander Gorbalenya, Alexey Kondrashov, David Lipman, Arcady Mushegian, Pavel Pevzner, Igor Rogozin, and Yuri Wolf. We greatly appreciate all the work that Roman Tatusov and Darren Natale put in the COG database, which permeates this book. We thank the following colleagues for critical reading of individual chapters and helpful criticisms: Chapter 3, Peter Cooper, Aviva Jacobs, David Wheeler, and Jodie Yin; Chapters 4, 5, 6, and 8, Igor Rogozin; Chapter 6, Fyodor Kondrashov; and Chapter 8, Yuri Wolf. Yuri Wolf kindly provided Figures 8.3, 8.4, and 8.5, and the entire sections 8.2 and 8.3 are largely the result of collaboration and intense discussions with Yuri Wolf and Georgy Karev. We thank L. Aravind, Trevor Fennon, Kira Makarova, Boris Mirkin, and Yuri Wolf for the kind permission to cite some of our unpublished joint work. Several figures in this book come from the NCBI Entrez Genomes web site. We appreciate the work of the team that supports this site. We are grateful to our editor Joanne Tracy for her constant prodding and encouragement, not to mention editorial support, without which this book would have never come to life. Last but not least, we thank our families for their enormous patience and understanding.

The opinions expressed in this book reflect personal views of the authors and have no relation to the official positions (if any) on the issues involved held by the National Library of Medicine, National Institutes of Health, or the US Department of Health and Human Services.

Eugene Koonin

Michael Galperin

Bethesda, August 2002

Image ch8f3
Image ch8f4
Image ch8f5
Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20259


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...