| Aug 1, 2024 | Call for talks deadline |
| Sept 9, 2024 | Notification of acceptance |
| Dec 2, 2024 | Poster abstract submission deadline |
| Jan 4-8, 2025 | Conference dates |
LLMs and biomedical annotations have a symbiotic relationship. LLMs rely on high-quality annotations for training and improvement, while they can also automate parts of the annotation process and improve its quality.
High-quality, well-annotated biomedical data is crucial for training LLMs to understand and process scientific information. These annotations can include labeling entities (genes, proteins), relations (interactions), and other relevant information. By incorporating annotated data, LLMs can learn specific domain knowledge and improve their accuracy in tasks like information extraction, knowledge base creation, and text summarization. Diverse and unbiased annotations can help mitigate bias in LLMs, ensuring their outputs are fair and representative of the underlying data.
LLMs can be used to automate some aspects of annotation, such as identifying potential entities or suggesting relevant relations. This can significantly reduce the workload for human annotators. LLMs can identify areas of uncertainty in the data and suggest which annotations would be most valuable for improving their performance. This creates a feedback loop where LLMs guide the annotation process for optimal results. Finally, LLMs can be used to check the consistency and accuracy of annotations, identifying potential errors or inconsistencies.
By addressing these challenges this workshop aims to clarify the potential and limits of LLMs in advancing biomedical research and knowledge discovery.
Saturday January 4th, 2025
| Time | Presenter | Title |
|---|---|---|
| 09:00-09:15 | Introduction [Slides] | |
| 09:15-09:45 | Cathy H. Wu | Text Mining and Large Language Models for Protein Functional Annotation [Slides] |
| 09:45-10:15 | Wei Wang | Information Retrieval in the Era of Large Language Models [Slides] |
| 10:15-10:30 | Coffee break | |
| 10:30-11:00 | Graciela Gonzalez | LLMs for health data annotation: Mixed results from real-life applications [Slides] |
| 11:00-11:30 | Qiao Jin | Artificial Intelligence for Evidence-based Medicine [Slides] |
| 11:30-12:00 | Panel discussion |
-
Cathy H. Wu: Text Mining and Large Language Models for Protein Functional Annotation
Abstract: To scale up annotation while minimizing issues with hallucinations when using the Large Language Model (LLM), researchers are exploring Retrieval-Augmented Generation (RAG) in which the LLM is provided with a set of facts pertaining to specific knowledge domain to constrain its responses. In this talk, I will present our work in harnessing the power of LLMs in conjunction with natural language processing (NLP) tools and curated knowledge in community resources and ontologies for evidence-based protein functional annotation at the UniProt Consortium. The UniProt knowledgebase associates protein entries with relevant scientific literature through a combination of in-house and community annotation and computational mapping. We are developing approaches using a generalizable RAG framework to scale up the process, including (1) prepopulating community submissions using LLM for author verification, and (2) creating short summaries for papers using LLM in the computational mapped bibliography. For the first task, we employ our NLP pipeline to extract structured information from full-text open access articles and use an LLM to validate the NLP-extracted results. For the second task, we are developing a system to create rich natural language summaries of functional information using an LLM grounded by the text identified by the NLP tool. To facilitate user query and discovery of protein knowledge, we will further provide knowledge graph (KG) and use metadata to reduce hallucinations and enhance accuracy in query generation. Our framework may be adaptable for other biomedical annotation tasks.
Short bio: With background in both biology and computer science, Dr. Cathy Wu has conducted bioinformatics and data analytics research for over 25 years. She has led the development of the Protein Information Resource (PIR) as a major bioinformatics resource and a member of the international UniProt Consortium with 5 million pageviews per month from over 500,000 unique sites worldwide. Dr. Wu has served on many scientific advisory boards, journal editorial boards, over 60 grant review panels for NIH, NSF and DOE, and over 60 international conference organizing committees. Recognized as a “Highly Cited Researcher” (top 1%) annually since 2014, she has published over 250 peer-reviewed papers. Her research encompasses protein functional analysis, biological text mining, ontology, gene-drug-disease network modeling, bioinformatics cyberinfrastructure, and big data analytics. In 2009, Dr. Wu was appointed Unidel Edward G. Jefferson Chair in Engineering and Computer Science at University of Delaware (UD). That same year, she established the Center for Bioinformatics and Computational Biology to foster collaborative research, where she serves as the Founding Director of the Bioinformatics Master’s, PhD, and graduate certificate programs. She has mentored more than 100 undergraduate and graduate students, post-graduate trainees, junior scientists and young investigators. The Center has developed cutting-edge bioinformatics and data analytics infrastructure, including genomic analytics capabilities for precision medicine. In 2018, she was appointed the Founding Director of the Data Science Institute, serving as a nucleating effort to catalyze and coordinate data science activities at UD, connecting researchers across seven colleges to foster multidisciplinary research collaborations in both foundations and applications of data sciences.
-
Wei Wang, Information Retrieval in the Era of Large Language Models
Abstract: The emergence of large language models has introduced a new paradigm in data modeling. These models replace specialized models (LLMs) designed for individual tasks with unified models that are effective across a broad range of problems. In biomedical domains, this shift not only transforms approaches to handling natural language tasks (e.g., clinical narratives, scientific papers) but also suggests new methods for dealing with other data types (e.g., molecules, proteins, pathology images). In many fields, LLMs has already shown great potential to accelerate scientific discovery. In this talk, I will present some latest advances and opportunities in this space.
Short bio: Wei Wang is the Leonard Kleinrock Chair Professor in Computer Science and Computational Medicine at University of California, Los Angeles and the director of the Scalable Analytics Institute (ScAi). She is also a member of the UCLA Jonsson Comprehensive Cancer Center, Institute for Quantitative and Computational Biology, and Bioinformatics Interdepartmental Graduate Program. She received her PhD degree in Computer Science from the University of California, Los Angeles in 1999. Dr. Wang's research interests include big data analytics, data mining, machine learning, natural language processing, bioinformatics and computational biology, and computational medicine. Dr. Wang received numerous awards in her career including an ACM fellow and an IEEE fellow. She is the chair of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD).
-
Graciela Gonzalez: LLMs for health data annotation: Mixed results from real-life applications
Abstract: This presentation explores the mixed results of using large language models (LLMs) for health data annotation across various sources. In electronic health records (EHR) notes, LLMs were applied to classify appropriate antibiotic use, insomnia phenotypes, and extract phenotypes from observations. In literature, tasks included virus RNA segmentation detection, SARS patient metadata extraction, and toponyms normalization. Social media applications involved detecting drug names, adverse drug events (ADEs), changes to medication treatment, and reasons for these changes. As time permits, we will discuss how, while LLMs offer efficiency and scalability, their performance varies across tasks due to challenges like data imbalance and annotation complexity.
Short bio: Dr. Graciela Gonzalez-Hernandez is Professor and Vice Chair for Research and Education in the Department of Computational Biomedicine at Cedars-Sinai Medical Center. She is an expert in natural language processing (NLP) and artificial intelligence, and has significantly advanced knowledge discovery in healthcare through her innovative research on extracting unstructured data from various sources. Previously, she was an Associate Professor at the University of Pennsylvania. Dr. Gonzalez-Hernandez has published over 210 scientific papers and is actively involved in mentoring and national research initiatives.
Dr. Gonzalez Hernandez has had substantial NIH funding for her innovative research. The National Institutes of Health (NIH) renewed her R01 grant "Social Media Mining for Pharmacovigilance" for $2.4 million, covering work through 2022. She has multiple NIH-funded research projects, including R01s, Ks, and R21s as Principal Investigator and collaborator. Her NIH-funded research spans various areas, including a Data Core for the Arizona Alzheimer's Disease NIA/NIH P30 for 8 years and a more recent NIA P30 – the Penn AI Tech ADRD Pilot Core. Her research has been supported by different NIH institutes, including the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious Diseases (NIAID), as well as the FDA, the CDC, and the National Board of Medical Examiners, among others.
Her NIH-funded project "Social Media Mining for Pharmacovigilance" has produced several key outcomes. The project developed the DeepADRMiner pipeline, an innovative system for extracting and normalizing adverse drug reactions from social media, enabling the analysis of patient-reported data. The research has demonstrated that trends in social media can match known drug side effects, validating the use of NLP methods for pharmacovigilance. Additionally, the project has contributed to the training and development of diverse students through research experiences at the Health Language Processing Center.
She serves as a member of the BDMA panel at NIH, and in the Board of Counselors for Intramural Research for the NLM's NCBI.
-
Qiao Jin: Artificial Intelligence for Evidence-based Medicine
Abstract: Evidence-based medicine (EBM) is a clinical approach that prioritizes the integration of the best available evidence from well-designed research into decision-making for individual patient care. Despite its transformative potential, EBM faces significant barriers in both the generation and utilization of evidence. Evidence generation primarily relies on clinical trials, yet one of the major challenges to their success is patient recruitment. To address this, we introduced TrialGPT, an end-to-end framework leveraging large language models (LLMs) for zero-shot patient-to-trial matching. Similarly, LLMs also hold significant promise in facilitating the utilization of medical evidence. However, a critical limitation is their tendency for hallucination—producing plausible but factually incorrect content. To mitigate this issue, I will present our work on augmenting LLMs with domain-specific literature retrieval and database utilities. By grounding their outputs in high-quality, well-curated data, this approach substantially reduces the risk of hallucination and ensures that their generated content is based on solid medical evidence.
Short bio: Dr. Qiao Jin works on AI for evidence-based medicine at NIH with Dr. Zhiyong Lu. Prior to that, he received his MD degree from Tsinghua University in 2022. Dr. Jin’s work has been published in top-tier venues such as Nature Communications, npj Digital Medicine, NeurIPS, and ACL. He developed the MedCPT foundation models that have been downloaded over 2 million times and the PubTator database that has been accessed over 1 billion times. His work has been adopted by Google, OpenAI, Microsoft, and Anthropic; and has been featured by Nature, POLITICO, and NIH News Releases. He has received the NIH Director’s Challenge Innovation Awards, AMIA Distinguished Poster Award, IMIA Best NLP Paper Award, and BioBank Disease AI Challenge Award. Dr. Jin serves on the editorial boards of JMIR and JBI.