RESEARCH INTERESTS

My broad and long-term research interest is to further the understanding of the structure, function, and evolution of genomes. I am particularly interested in how transposable elements impact each of these aspects.

I entered the field of genomics as a Ph.D. student in 2001, shortly after the first draft(s) of the human genome was published. Since then, I have been using computational methods to study various aspects of genomic evolution, such as the ones highlighted in the abstracts below. As a research fellow in Ivan Ovcharenko's group, my work contributes to advancing the knowledge of gene regulation at genomic level. This includes predicting novel distant regulatory elements, which can then be tested experimentally by our collaborators.


Distant Regulatory Elements of co-regulated genes (DiRE)

Regulation of gene expression in eukaryotic genomes is established through a complex cooperative activity of proximal promoters and distant regulatory elements (REs) such as enhancers, repressors, and silencers. DiRE is a web server that implements the previously described Enhancer Identification (EI) method to determine the chromosomal location and functional characteristics of distant REs in higher eukaryotic genomes. The server uses gene co-expression data, comparative genomics, and combinatorics of transcription factor binding sites (TFBSs) to find TFBS-association signatures that can be used for discriminating specific regulatory functions. DiRE's unique feature is the detection of REs outside of proximal promoter regions, as it takes advantage of the full gene locus to conduct the search. DiRE can predict common REs for any set of input genes for which the user has prior knowledge of co-expression, co-function, or other biologically meaningful grouping. The server predicts function-specific REs consisting of clusters of specifically-associated TFBSs, and it also scores the association of individual TFs with the biological function shared by the group of input genes. Its integration with the Array2BIO server allows users to start their analysis with raw microarray expression data.

DiRE server: http://dire.dcode.org

Transposable elements: from genomes to proteomes

It is now well established that transposable elements (TEs) represent almost half of the human genome as well as significant fractions of other mammalian genomes. Although initially regarded as one of the main classes of "junk DNA", it has been gradually discovered that transposable elements actually play important functional roles, such as regulating gene expression and facilitating recombination.

PTPN1

Transposable elements are also thought to contribute to protein coding sequences because many transcripts have been found to contain TE fragments in their coding sequence. However, experimental confirmation lacks for most of these cases, thus an investigation of well-characterized proteins was necessary. I attempted to do that with the collection of human proteins from the Protein Data Bank (PDB), which, in spite of being a small and biased dataset, served well the purpose. In spite of finding many candidates, careful phylogenetic analyses revealed that very few proteins are likely to contain TE-encoded fragments. Interestingly, all of these proteins arose after a series of duplication events and contain fragments of very old TEs. For example, a fragment of the protein tyrosine phosphatase non-receptor type 1 (PTPN1) is encoded by an L3-like element (highlighted in brown and yellow in the rotating 3D structure). For detailed results and discussions, please see Gotea and Makalowski (2006) Do transposable elements really contribute to proteomes? Trends in Genetics 22(5): 260-267.

Alu elements are a special class of transposable elements, because of their high incidence not only in the human genome (more than 1 million copies), but also in alternatively spliced transcripts. Contribution of Alu elements to proteomes has been addressed by a few recent studies but remains a controversial issue. In an attempt to clarify this aspect, I am currently evaluating the potential of Alu elements to contribute to functional proteins by using various computational methods, such as evolutionary analysis and homology protein modeling. For the latter I collaborate with Dr. Jordi Bella from University of Manchester, under whose supervision I did a six-week training stage in 2003 thanks to Penn State WUN program.

The ScrapYard Database

Transposable elements can impact the functionality of genomes both as components of the genomes themselves (e.g. they provide transcription regulating signals) and, post-transcriptionally, as components of mature transcripts. To facilitate the study of the latter, we built the ScrapYard Database (SYDB), which provides easier access to transposable element details in vertebrate transcriptomes. Its current interface is rather simple and its content limited to three species (human, mouse, rat), but we are working to incorporate information for thirteen vertebrate species and to enhance its searching capabilities. Updates will be available at the SYDB web site.

Insect Spliceosomal snRNAs

Our group has engaged in annotating and characterizing the spliceosomal small nuclear RNA (snRNA) genes from the recently sequenced honey bee genome, as well as from ten other insect genomes. For me, this was only a side project in the beginning, but as the computational demand increased, I became gradually more involved with various aspects of the analysis. I was especially interested in the conservation and evolution of the gene promoters (the PSEA signal to be more precise), as well as in the evolution of the snRNA genes themselves, which turned out to yield some interesting observations. For example, we found that Dipteran species contain a divergent U11 snRNA gene, which might explain the relatively low number of U12-type introns found in these species. You can find more details about this analysis in our recently published RNA paper (Mount et al. 2007). We also contributed to the community effort of annotating the genome sequences of 12 Drososophila species by expanding our annotation of the spliceosomal snRNA genes to all 12 available Drosophila genomes.

Watch your "words"!

My work in the Makalowski lab started with a hunt for duplicated segments in a few vertebrate genomes (human, mouse, rat, and Japanese pufferfish). This yielded different duplication profiles for these species, which provide clues about the timing of different duplication waves that marked the genomic history of these species. A poster with our findings was presented at the 50th DNA anniversary meeting at Cold Spring Harbor Laboratory [Abstract].

For the purpose of searching duplicated segments we relied on NCBI's popular MegaBLAST software. Unfortunately, not all parameters were well documented, and we learned the hard way that the default word size ("-W" parameter) of 28 was not appropriate for finding duplicated segments of at least 1 kb in length and 80% similarity. You can try to find yourself why, using the graph of the word hit probability density function:

Variation of probability density function of seed/word hit probability

From this graph it should be clear that a big word will miss a lot of potential matches, thus an ideal word should be as small as possible to ensure that its hit probability will be 1. However, a small word size determines an increase in both the memory usage and running time of MegaBLAST, thus the ideal word size value should be as big as possible while maintaining a hit probability of 1 (or very close to 1). Knowing the optimal word size is especially important for genomic size searches, where both query and target sequences can be as big as ~250MB.

We definitely had to watch our "words"! In a joint effort with Vamsi Veeramachaneni, we analyzed in detail the influence of the word/seed size on the quality of BLAST searches and we developed a set of recommendations for the appropriate use of the word parameter in one of the most popular bioinformatics tools: BLAST (Basic Local Alignment Search Tool). MegaBLAST is capable of using discontiguous words/seeds as well, but because the implemented patterns are suboptimal (their hit probability was determined by simulations), we determined the optimal ones by computing their exact hit probability. A detailed analysis was published in Nucleic Acids Research. We were happy to learn recently that NCBI is currently working on implementing our recommendations into their popular BLAST package [relevant link].

Tang and Lewontin test

In 1999, Hua Tang and Richard Lewontin proposed a nice test to detect regions of increased or decreased variability (e.g. mutational "hot" and "cold" spots) and test their statistical significance. While detecting the location of such regions is relatively easily accomplished by calculating G(xk), testing their statistical significance involves constructing the distribution of the T statistic under the null hypothesis of even distribution of the events (e.g. SNPs) across the sequence (see reference).

Even though Tang and Lewontin's table 1 contains several critical T values, I thought that it would be nice to have a tool for creating null distributions for any combination of n and N, which are unlikely to be round numbers in real life research. With a bit of creative fun, I wrote a small C program that does just that: creates the null distribution of T statistic for a specified combination of events (n) and sequence length (N). Feel free to download the source code below, and use it in your own research (make sure to read the comments in the main file called tlw.c).

Unsmoothed distribution of T under the null

Note that a smoothing parameter is required to obtain a symmetrical distribution, as in their Fig. 2, where a "somewhat arbitrarily" chosen smoothing parameter of 0.005 was used (trial and error more likely). The distribution without the smoothing parameter would look like in the figure above (note the double resolution of the graph as compared to Fig. 2). It might take a few trials to get it right, but it should not be too hard.

The source code should be compilable on any computers running *nix (MacOS X included, as gcc is likely to be there) or MS-DOS/Windows (a C compiler, such as djgpp, is needed). Should you run into troubles running it, or should you have any questions and/or comments, please feel free to email me.

Download source code: [tlw.zip] [md5]

Reference: Tang, H., and R.C. Lewontin (1999) Locating regions of differential variability in DNA and protein sequences. Genetics 153: 485-495 [Full text]



Home | Last updated: April 2008