Yi-Kuo Yu

Senior Investigator
National Center for Biotechnology Information (NCBI)
National Library of Medicine (NLM)
National Institutes of Health (NIH)
Bldg. 38A, Room 6S610
9000 Rockville Pike, MSC 3829
Bethesda, MD 20894, USA
Tel: (301) 435-5989
Fax: (301) 480-2290, (301) 480-2288
e-mail: yyu <at> ncbi.nlm.nih.gov



Principal Research Interests

Our group investigates various biological problems at multiple levels of detail in order to gain quantitative understanding in biology. At the microscopic level, we aim to build a solid foundation for quantitative understanding of biomolecular interactions. At the more coarse-grained level, we develop/employ computational approaches with sound statistical foundations to enhance the separation of information from noise in massive biological data sets, thereby paving the way for the discovery of and putting constraints on higher organizational principles in biology. A major goal of our group is to foster a solid connection between medical research and fundamental scientific research.

Molecular Interactions (MI):

Our studies at the microscopic level have concentrated on the most important component in biomolecular interactions, i.e., electrostatics. These studies aim to provide an accurate description of electrostatic interactions among biomolecules. This effort has resulted in a new electrostatics formulation involving complicated dielectric media. This new formulation permits, for the first time, a controllable approximation for the calculation of electrostatic energy and forces1-3. Consequently, one can easily estimate the magnitudes of errors for the quantities computed and one may improve the accuracy as much as one wishes by incorporating more prescribed correction terms in the computation. We are also investigating the quantum mechanical effect that governs molecular bindings and interactions.

Molecular/Information Networks (MN):

The advent of the genomic era has enabled rapid accumulation of information including DNA/protein sequence data, protein/RNA structural data, and biomolecular interaction data. These valuable, and often redundant, data allow researchers to mine relevant information at various organizational levels ranging from determining active sites in protein domains to uncovering relations among functional pathways and even whole cell organization. However, different combinations of these data can also be the common basis of two conflicting claims. To avoid errors introduced through additional annotations, we have developed a method, called information flow, to detect the information transduction modules responsible for propagating information from one node in the network to another. When applied to the protein-protein interaction network, this method illuminates nodes involved in the relevant biological pathways connecting the two specified nodes 4. This framework is also applicable to information filtering in any community network such as recommendation systems5-7. We are currently constructing other means to meaningfully extract important information from a generic interaction network.

Mass Spectrometry, Statistics, and Proteomics (MS):

At a more macroscopic level, we are interested in several topics where robust statistical analyses have been proven valuable. In the realm of sequence alignment, we have worked on improving the statistical accuracy and the retrieval efficiency by various developments8-16. In terms of bioinformatics and proteomics studies, to extract biologically relevant information we have substantially invested our effort in developing useful tools with robust statistical foundation. For example, we have developed computational tools for peptide/protein identification from tandem mass spectrometry (MS/MS) data17-20 and methods to improve statistical significance assignment [cite] in this area. We have also integrated existing knowledge such as protein modifications and their accompanying disease associations with our peptide searches. Our goal in this general direction is to enhance the separation of information from noise in massive biological data sets, thereby putting constraints on higher organizational principles in biology yet to be discovered.



1. Yu YK (2003) On a class of integrals of Legendre polynomials with complicated arguments--with applications in electrostatics and biomolecular modeling. Physica A, 326: 522-33.
PMID: 15759366

2. Doerr TP, Yu YK (2004) Electrostatics in the presence of dielectrics: The benefits of treating the induced surface charge density directly. American Journal of Physics, 72: 190-6.
DOI: 10.1119/1.1624115

3. Doerr TP, Yu YK (2006) Electrostatics of charged dielectric spheres with application to biological systems. Phys Rev E, 73: 061902.
DOI: 10.1103/PhysRevE.73.061902

4. Stojmirovic A, Yu YK (2007) Information flow in interaction networks. J Comput Biol, 14: 1115-43.
PMID: 17985991

5. Zhang YC, Blattner M, Yu YK (2007) Heat conduction process on community networks as a recommendation model. Phys Rev Lett, 99: 154301.
PMID: 17995171

6. Yu YK, Zhang YC, Laureti P, Moret L (2006) Decoding information from noisy, redundant, and intentionally distorted sources. Physica A, 371: 732-44.
DOI: 10.1016/j.physa.2006.04.057

7. Laureti P, Moret L, Zhang YC, Yu YK (2006) Information filtering via Iterative Refinement. Europhys Lett, 75: 1006-12.
DOI: 10.1209/epl/i2006-10204-8

8. Yu YK, Hwa T (2001) Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J Comput Biol, 8: 249-82.
PMID: 11535176

9. Yu YK, Bundschuh R, Hwa T (2002) Hybrid alignment: high-performance with universal statistics. Bioinformatics, 18: 864-72.
PMID: 12075022

10. Yu YK, Bundschuh R, Hwa T (2002) Statistical significance and extremal ensemble of gapped local hybrid alignment. Lecture Notes in Physics, Springer Berlin/Heidelberg, 585: 3-21.
DOI: 10.1007/3-540-45692-9_1

11. Kschischo M, Lassig M, Yu YK (2005) Toward an accurate statistics of gapped alignments. Bull Math Biol, 67: 169-91.
PMID: 15691544

12. Yu YK, Altschul SF (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics, 21: 902-11.
PMID: 15509610

13. Sardiu ME, Alves G, Yu YK (2005) Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Phys Rev E Stat Nonlin Soft Matter Phys, 72: 061917.
PMID: 16485984

14. Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schaffer AA, Yu YK (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J, 272: 5101-9.
PMID: 16218944

15. Yu YK, Gertz EM, Agarwala R, Schaffer AA, Altschul SF (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res, 34: 5966-73.
PMID: 17068079

16. Gertz EM, Yu YK, Agarwala R, Schaffer AA, Altschul SF (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol, 4: 41.
PMID: 17156431

17. Alves G, Ogurtsov AY, Wu WW, Wang G, Shen RF, Yu YK (2007) Calibrating E-values for MS2 database search methods. Biol Direct, 2: 26.
PMID: 17983478

18. Alves G, Ogurtsov AY, Yu YK (2007) RAId_DbS: peptide identification using database searches with realistic statistics. Biol Direct, 2: 25.
PMID: 17961253

19. Doerr TP, Alves G, Yu YK (2005) Ranked solutions to a class of combinatorial optimizations - with applications in mass spectrometry based peptide sequencing and a variant of directed paths in random media. Physica A, 354: 558-70.
DOI: 10.1209/epl/i2006-10204-8

20. Alves G, Yu YK (2005) Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics. Bioinformatics, 21: 3726-32.
PMID: 16105903