SSLpheno: a self-supervised learning approach for gene-phenotype association prediction using protein-protein interactions and gene ontology data

Bioinformatics. 2023 Nov 1;39(11):btad662. doi: 10.1093/bioinformatics/btad662.

Abstract

Motivation: Medical genomics faces significant challenges in interpreting disease phenotype and genetic heterogeneity. Despite the establishment of standardized disease phenotype databases, computational methods for predicting gene-phenotype associations still suffer from imbalanced category distribution and a lack of labeled data in small categories.

Results: To address the problem of labeled-data scarcity, we propose a self-supervised learning strategy for gene-phenotype association prediction, called SSLpheno. Our approach utilizes an attributed network that integrates protein-protein interactions and gene ontology data. We apply a Laplacian-based filter to ensure feature smoothness and use self-supervised training to optimize node feature representation. Specifically, we calculate the cosine similarity of feature vectors and select positive and negative sample nodes for reconstruction training labels. We employ a deep neural network for multi-label classification of phenotypes in the downstream task. Our experimental results demonstrate that SSLpheno outperforms state-of-the-art methods, especially in categories with fewer annotations. Moreover, our case studies illustrate the potential of SSLpheno as an effective prescreening tool for gene-phenotype association identification.

Availability and implementation: https://github.com/bixuehua/SSLpheno.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Gene Ontology
  • Genomics*
  • Neural Networks, Computer*
  • Phenotype
  • Supervised Machine Learning