Highly accurate disease diagnosis and highly reproducible biomarker identification with PathFormer

Biomarker identification is critical for precise disease diagnosis and understanding disease pathogenesis in omics data analysis, like using fold change and regression analysis. Graph neural networks (GNNs) have been the dominant deep learning model for analyzing graph-structured data. However, we found two major limitations of existing GNNs in omics data analysis, i.e., limited-prediction/diagnosis accuracy and limited-reproducible biomarker identification capacity across multiple datasets. The root of the challenges is the unique graph structure of biological signaling pathways, which consists of a large number of targets and intensive and complex signaling interactions among these targets. To resolve these two challenges, in this study, we presented a novel GNN model architecture, named PathFormer, which systematically integrate signaling network, priori knowledge and omics data to rank biomarkers and predict disease diagnosis. In the comparison results, PathFormer outperformed existing GNN models significantly in terms of highly accurate prediction capability (~30% accuracy improvement in disease diagnosis compared with existing GNN models) and high reproducibility of biomarker ranking across different datasets. The improvement was confirmed using two independent Alzheimer’s Disease (AD) and cancer transcriptomic datasets. The PathFormer model can be directly applied to other omics data analysis studies.


Introduction
Due to the advent of next generation sequencing (NGS) and high throughput technologies, large-scale and personalized omics data have been being generated.The analysis of the omics datasets has uncovered many novel disease-associated targets.However, for most of diseases, the complex and mysterious disease pathogenesis remains unclear yet.In the omics data analysis, biomarker or target identification is critical for precise disease diagnosis and understanding the disease pathogenesis in omics data analysis, like using fold change and regression analysis.For example, the fold change plus p-value via statistical analysis, or regression analysis are the widely used models to rank targets, followed by functional analysis of the top-ranked targets.However, these approaches cannot model the signaling interactions among these individual targets/proteins.On the other hand, signaling networks, like signaling pathways and protein-protein interactions (PPIs), which formulate the genome-wide association/interactions of multiple genes, are ubiquitous in various bioinformatical applications, including drug synergy prediction Podolsky and Greene [2011], Hopkins [2008], DVH prediction Dong et al. [2024], Alzheimer's disease (AD) detection Song et al. [2019], Qin et al. [2022], cancer classification Lu and Han [2003], Viale [2012], Amrane et al. [2018], etc.The network-based analysis can identify the stable network module biomarkers or hub genes that can help understand the contributions of gene sets of pathways to disease phenotypes.
Signaling networks are one type of essentially graph modality.Nowadays, deep learning models are in great demand to analyze signaling networks in a computational way.Various deep models Yang et al. [2014], Horvath and Dong [2008], Song and Zhang [2015] have been proposed to predict disease phenotype from gene expressions, yet they did not consider the interactions between genes, thus failing to capture the joint role of multiple genes in determining the phenotypic variability.Graph neural networks (GNNs) Gilmer et al. [2017], Kipf and Welling [2016], Scarselli et al. [2008], Velickovic et al. [2018], Dong et al. [2022] are the dominant architecture for modeling graphstructured data and have achieved impressive performance on analysis tasks over various graphs, including social networks, molecules and circuits Berg et al. [2017], Bian et al. [2020], Dong et al. [2023a].Though GNNs simultaneously encode the gene expression profiles and genetic interactions, several drawbacks limit their potential in the real-world bioinformatical applications.(1) First, current GNNs always exhibit subpar performance in predicting the disease phenotype.For instance, in the Alzheimer's disease (AD) classification task where GNNs are required to distinguish AD samples from controls, we observe that the classification accuracies of existing/dominant GNNs are close to 0.6, which are slightly better than random guesses (limited diagnosis accuracy).(2) Second, current GNNs fail to provide interpretable results of biological meaning.A common interpretation pipeline is to extract gene subset from the input gene network, then gene set enrichment analysis (GSEA) can help researchers identify key biological pathways and processes.Though GNN architectures, such as SortPool and GAT, provide ways to rank the nodes' contribution to select the gene subset, these techniques are not robust nor disease specific (limited-reproducible target ranking).
The unique graph structure of biological signaling pathways, which consists of a large number of targets and intensive and complex signaling interactions among these targets, are believed as the root of the challenges.As seen in Figure 3-a,b, we mathematically characterized the limitations of existing GNNs in gene network representation learning.Specifically, compared to graphs in the popular benchmark graph datasets, gene networks usually contain thousands of genes/nodes, many of which has the extremely large node centrality.Such properties cause the subpar prediction performance of dominant GNNs from two aspects: (1) Dominant GNNs suffer the over-squashing problem Xu et al. [2018], Alon and Yahav [2020] for graphs with large average node degree/centrality; (2) Dominant expressive GNNs, such as subgraph-based GNNs and high-order GNNs Morris et al. [2019], Grohe [2021], have the space/time complexity issue when applied to large-scale graphs like gene networks.Thus, novel GNN models are needed for the signaling network-based omics data analysis for disease diagnosis and biomarker detection.
The objectives of the paper are to develop a powerful GNN for precise prediction and robust gene subset detection in gene-network-based bioinformatical tasks (see Figure 1).Herein, we propose a novel graph convolution architecture called PathFormer encoder layer, and Figure 2-a illustrate the architecture.The PathFormer encoder layer is constructed upon the Transformer architecture Vaswani et al. [2017] to aggregate information through the self-attention mechanism, which is proven to be an effective solution to the over-squashing problem.Then, we use the universal orders of genes as positional encodings in the Transformer architecture, and it has been proven in Dong et al. [2023c] to be an ideal solution to maximize the expressive power for better prediction performance without intriguing any complexity issue.In the end, the original Transformer model is designed for sequence data rather than graphs, thus ignoring pathway information in graphs.Hence, the proposed PathFormer encoder layer injects the gene-pathway information as a learnable attention bias matrix to the attention mechanism, so that it provides flexibility to capture the pathway relation between any pair of genes in a gene network.
Beside the expressivity to generate accurate prediction, interpretability (transparency) is also critical for deep learning models in practical bioinformatics.Thus, interpretation technique is desirable for discovering biologically meaningful pathways and processes associated with particular disease.In this paper, we aim to detect the gene subset that can be used to decipher the disease-specific biological knowledge, and develop a Knowledge-guided Disease-specific Sortpool layer (KD-Sortpool) to achieve this goal.KD-Sortpool resorts to the sort-and-pool strategy Zhang et al. [2018], Lee et al. [2019], which sorts genes in the input gene network according to some metric value V and then Figure 1: Overview of proposed framework for gene signaling network analysis with GNNs.Basically, gene interactions and gene expressions are obtained from genomic omics data to formulate gene networks/graphs.Then GNNs are used to perform the prediction task accurately and efficiently, while detecting robust disease-specific gene subset to understand the relation between hub genes and disease phenotypes.
keep the top K genes.To incorporate disease-specific information and biological knowledge, we use quantitative measurement, such as the GDA score from DisGeNET, to characterize the gene-disease association.Then, KD-Sortpool computes a distribution of the gene selection by normalizing the metric value V across genes, then the distribution is used to compute the expectation of the genedisease association measurement, which is used as a regularization term in the objective function of the optimization problem.We tested our PathFormer model on two challenging bioinformatical tasks in real world: Alzheimer's disease (AD) classification task and cancer classification task.Two AD datasets (Mayo and Rosmap) and one cancer dataset are used.Experimental results demonstrate that our PathFormer model can beat existing AI models on all datasets.The average improvement of prediction accuracy is at least 38% on AD datasets and 23 % on cancer dataset.Furthermore, we show that a disease-specific stable set of genes were identified by the PathFormer model.Based on the prior knowledge of a particular disease, the KD-Sortpool layer select top K genes as the gene subset for the purpose of outcome interpretation.Then PathFormer encoder layers iteratively update features of each gene by aggregating its' neighbors' gene features.In the end, PathFormer summarize all gene features through a MLP (multiple layer perceptron) to generate a vector embedding of the input gene network, and then the vector embedding is used to predict the disease or a particular phenotype.

Knowledge-guided Disease-specific Sortpool
The sort-and-pool strategy is widely adopted to select "important nodes" in graphs.Existing GNN models, such as Sortpool Zhang et al. [2018] and SAGpool Lee et al. [2019], sort nodes according to the learnt node representations/vectors after multiple graph convolutional operations.Hence, the sorting operation is individualized.That is, graphs with different node features and topology may select completed different nodes of interest.In omics data analysis studies, biomarker ranking is different, and is usually based on a group of samples like AD or cancer subtypes vs control.To achieve this purpose, we design the Knowledge-guided Disease-specific Sortpool (KD-Sortpool).
Unlike general graphs in popular graph deep learning tasks, gene signaling networks have some unique properties that can be utilized to design powerful neural architectures.Overall, we find that each gene (name) appears at most once in each signaling network, and the connection of any pair of genes is shared among different signaling networks.Let G = {G n = (V n , E n )|n = 1, 2, ..., N } represents the group of signaling networks for all patients (AD or control samples in AD datasets) to study, where V n contains genes of the sample n and E n contains reported connections of these genes.Then there is a gene set S = Gn∈G {v|v ∈ V n } that contains all possible genes to analyze in the group G.
Then we can define an order function h on the overall gene set S, such as a lexicographical sort of gene names, to provide a unique way to order genes in S.After that, we define a learnable vector λ of size |S| to formulate distribution of gene selection.In this vector λ, each element λ p assigns gene p ∈ S a trainable importance score.Consequently, for each sample/patient-specific signaling network G n = (V n , E n ) ∈ G, the probability of selecting a node/gene v is computed as: The function 1 indicates that the gene selection process is independent of initial gene features (i.e. gene expressions).The reason is that the group-based analysis of expressions of the same gene in different sample groups (like AD sample group v.s.control sample group) sometimes provide contradictory results in different datasets.Appendix F takes the AD classification as an example, and we compare the difference of the mean gene expression between AD samples and control samples for each gene in signaling network.For most genes, the t value is significantly larger than 0 on dataset MAYO when it is significantly smaller than 0 on dataset ROSMAP, and vice versa.This observation indicates that if a disease-specific group-based gene ranking function is dependent on gene expression values, it tends to detect different patterns of important genes in dataset MAYO and ROSMAP, which leads to contradictory conclusion since both datasets are used to study the Alzheimer's disease.
To incorporate disease-specific information in the gene selection process without using gene expressions, KD-Sortpool proposes to quantify the gene-disease association of genes in the set S through prior biological knowledge.In this study, we use the open-source database DisGeNET Piñero et al.
[2021], Kanehisa and Goto [2000] to simplify the quantitative estimation of each gene-disease association from text profiles of publications supporting that association.Multiple scores, such as GDA score and VDA score, are available from DisGeNET to achieve this purpose.Then, KD-Sortpool takes the GDA score of a gene v as the quantitative estimation e(v) of its gene-disease association.
Equipped with the gene selection distribution ϵ(v) and the gene-disease association estimation e(v), KD-Sortpool will deterministically sort the probabilities {ϵ(v)|v ∈ V n } of all genes in the input gene network G n = (V n , E n ) of a patient, then it keeps the top K genes as the gene subset of interest.Let S n ⊆ V n be the gene subset.Since genes are selected in an independent way, KD-Sortpool ignores the association of multiple genes and the disease.Then, the expectation of the association between the selected gene subset S n and the disease can be estimated as:

PathFormer Encoder Layer
Since KD-Sortpool ignores the co-effect of genes' functionality, the following graph convolution layers (i.e.PathFormer encoder layers) are supposed to produce an output/prediction that focuses on different genes in the input signaling network.As a fundamental component in many deep neural architectures, the attention mechanism helps in identifying the most important items in the inputs and has achieved the state-of-the-art predictive performance in various deep learning tasks like natural language processing.In addition, the attention mechanism has also been proven to be an effective solution to the over-squashing problem Alon and Yahav [2020].Consequently, we resort to the attention mechanism when designing the PathFormer encoder layer to improve the diagnosis accuracy.
In the context of deep learning, especially in models like Transformer, the attention mechanism assigns different weights to different items of the input by computing a key vector, query vector and value vector for each item based on its' features.Query vectors and key vectors are used to compute the similarity scores through a similarity function like dot product, then a softmax function transforms these similarity scores into attention weights, which are used to calculate a weighted sum of corresponding value vectors to generate the outputs of the attention mechanism.Thus, when the attention mechanism is applied to graphs like signaling networks, the key and query of a node (gene) in the input graph (signaling network) is simply dependent on its' features (profile expression).
Then, the weight of a pair of nodes (genes) assigned by the attention mechanism is always the same regardless of how these two nodes (genes) are connected in the input graph (signaling network).In the field of bioinformatics and system biology, genes are studied in biological pathways to explain their relation to specific phenotype and disease.Hence, we propose a Pathway-enhanced Attention Mechanism (PAM) in the PathFormer encoder layer to incorporate the pathway information of genes in the computation of their connection strength.Enumerating all pathways between each gene pair in a signaling network can be computational complex and impractical.In contrast, our PAM introduces a Signaling Network Pathway Modeling Framework (SNPMF) to provide a vector for each gene in a signaling network that can injective represents upper-bounded size pathways contain the gene.
One interesting property of signaling networks is that the connection/edge of any pair of genes is shared among different signaling networks.Recall that represents the group of signaling networks to study, and S = Gn∈G {v|v ∈ V n } that contains all possible genes in G. Thus, for any two genes u, v ∈ S, if they are connected in one signaling network (i.e.∃i st.(u, v) ∈ E i ), then they are also connected in any other signaling network when they are obtained in that network (i.e. for ∀j, we have (u, v) ∈ E j if u, v ∈ V j ).This property indicates that encoding multiple pathways is equivalent to encode genes in these pathways regardless of the gene connections.Then our objective is to find a way to encode genes that can be shared among signaling networks.The sorting function h on the overall gene set S provides an ideal solution as it generates the same feature for the same gene across different signaling networks.Based on above analysis, we propose SNPMF and Figure 2-a-a1 illustrates the framework.SNPMF generates a vector p(v) of size B × |S| to represent pathways around each gene v in the signaling network G n .Here B is a hyper-parameter that determines the bounded size of pathways.p(v) is initialized initialized as a zero vector.Then, for any gene u in signaling network G n that is on a pathway contains gene v and the distance to gene v is d < B, SNPMF sets the element (d − 1) × |S| + h(u) in the vector p(u) to be 1.
Next, we introduce the Pathway-enhanced Attention Mechanism (PAM) and corresponding Path-Former encoder layer.Figure 2-a-a2,a3 illustrate their architectures.Compared to a standard attention mechanism, PAM concatenates vector p(v) generated by SNPMF and gene features, and then utilizes MLPs to compute keys and queries of genes.Thus, the attention weights in PAM are enhanced by incorporating pathway information in the signaling network.On the other hand, a linear projection layer is used to learn the values (of genes) based on initial gene features.In analog to a standard Transformer encoder layer, our PathFormer encoder layer consists of a PAM and a subsequent feed-forward network FFN, which consists of a standard Dropout layer → FC (fully connected) Layer → Activation Layer → Dropout Layer → FC Layer, with a residual connection from before the first FC layer to after the dropout immediately following the second FC layer.
In the end, we present the overall mathematical formulation of the proposed PathFormer encoder layer.
A gene subset S n and K genes in the input gene network will be extracted after KD-Sortpool layer.Let T ∈ R K×d l be the matrix of gene feature vectors to the l-th PathFormer encoder layer, where d l is the dimension of gene features in the layer.Let p l = [p(v 1 ), p(v 2 ), ..., p(v K )] T ∈ R K×B|S| be the matrix pathway vectors generated by SNPMF for genes in subset S.Then, the key matrix K l , query matrix Q l and value matrix V l in PAM is computed as following, Then the attention matrix Att l is computed as a softmax of the dot product of key matrixK l and query matrix Q l , where the attention weights in matrix Att l are used to calculate a weighted sum of the corresponding values in V l to capture the relevant information from the genes based on the importance indicated by the attention weights.
After that, the output of the current PathFormer encoder layer l is computed through the feed-forward network.
The output gene feature matrix O l is used as the input to the next PathFormer encoder layer l + 1.
That is Z l+1 = O l .In the first PathFormer encoder layer, the input feature vector of a gene v takes the concatenation of its' expression values and the one hot encoding of p(v), where the one hot encoding of p(v) works as the positional encoding in a standard Transformer model to identify genes' position based on it's order in the set S.

Readout Mechanism
The last PathFormer encoder layer (i.e.layer L) can output matrix O L (or Z L+1 ) that contains learnt embeddings of genes (i.e. gene feature vectors) in the gene subset S n generated by KD-Sortpool.We seek a readout mechanism to generate a vector z from O L as the representation of the input gene signaling network.To avoid the information loss, the readout mechanism needs to encode the order of genes in the universal gene set S and contain all genes in the gene subset S n .Thus, we use, where W h(v k ) is the trainable weight matrix related to gene whose order in set S is h(v k ).In the end, z is submitted to a MLP for obtaining the final prediction ŷ, which can provide the estimated probability vector of classification through a softmax operation.

Loss Function
Cross Entropy (CE) Loss This paper studies the Alzheimer's disease (AD) classification and cancer subtype classification, thus the classification loss takes the cross-entropy loss, Where N is the number of samples/patients; C is the number of classes in the problem; y ( n, c) is the ground truth label of patient/sample n such that y ( n, c) = 1 if the sample is in the class c; ŷn,c is the c-th entry of sof tmax(ŷ) of the patient/sample n.
Gene Subset Consistency (GSC) loss.KD-Sortpool introduces an approach to estimate the association strength A(S n ) between the gene subset S n and a particular disease based on the trainable distribution ϵ(v) and the gene-disease association value e(v).To force the gene selection process to be disease-specific and consistent with prior biological knowledge associated with the disease of interest, we propose the gene subset consistency (GSC) loss based on the formulation of A(S n ).
Since e(v) takes the GDA score of gene, which ranges from 0 to 1, A(S n ) is upper bounded by v∈Sn ϵ(v) .Then, our GSC loss takes: Thus, a smaller GSC loss indicates a stronger association of the selected gene subset S n and the disease of interest.Lastly, the overall loss function takes following formulation, Where σ is a tunable hyper-parameter.The objective is to minimize the objective loss.When samples from diseases of significantly different prior biological knowledge, L gsc serves as a regularization term and will penalize it if the KD-Sortpool select very similar gene subsets for these completely different diseases.

Interpretation from PathFormer
Target identification and target-target co-effect estimation.The remaining genes in the last graph layer are the identified targets.The number of remaining genes, like 50 or 100 targets, is a model parameter that is set by users.In analog to self-attention mechanism, the Pathway-enhanced Attention Mechanism (PAM) in the PathFormer encoder layer provides an in-hoc approach to interpret the co-effects of genes.The attention matrices {Att l,n |l = 1, 2, ..., L; n = 1, 2, ...N } enable us to compute the population-based connection strength between any gene pair (i, j) as following: Here N is the number of patients/samples; L is the number of PathFormer encoder layers.

Experiment Setup
We use NVIDIA Tesla GTX 1080Ti GPUs to train/test our PathFormer model and other deep learning baselines.Python environment is set up and model architectures are constructed based on Pytorch and Pytorch geometric library.To provide robust evaluation, we perform 5-fold cross validation to test the predictive performance of each model and report the average prediction accuracy as well as the standard deviation across folds.
In the experiment, our PathFormer model is implemented with one KD-Sortpool and two subsequent PathFormer encoder layers.In the KD-Sorpool layer, we test different K (the number of gene to select) from the set {100, 500, 1000} in the section 3.5 to validate its ability to detect gene subset of different size.In each PathFormer encoder layer, the dimension of gene features d l is set to be 32; GNNs that computes the query matrix and key matrix takes two GIN Xu et al. [2019] graph convolution layer, where the feature dimensions of both the hidden layer and output layer are set to be 32.Other MLPs in PathFormer take 2 layers where the feature dimension of the hidden layer is set to be 64.When optimizing the parameters of PathFormer model and other deep learning baselines, we use the Adam optimizer with an initial learning rate of 0.001 and the learning rate will anneal to half every 30 epochs; The training process is stopped when the validation metric does not improve further under a patience of 5 epochs.

Highly accurate prediction capability
We compare our PathFormer model with existing state-of-the-art deep learning (DL) models and popular DL models for gene expression analysis to evaluate the classification accuracy in different bioinformatical tasks.Especially, previous DL models only achieve a prediction accuracy slightly better than random guess (accuracy = 0.5) in AD classification, which limits the applicability in real world.Now, our PathFormer improves the prediction accuracy in AD classification to around 0.8, which is a desirable level for applications.

Highly reproducible biomarker detection
To evaluate the biomarker detection across different datasets, we tested three different hyper-parameter K (i.e.number of genes to select) in KD-Sortpool, i.e., K = 100, K = 500, K = 1000, K =number of all genes (optional). Figure 4-a visualizes the detected gene subsets from two AD datasets (i.e.Mayo & RosMap) and one cancer dataset.We find that the detected gene subset expands as we increase K value.That is, if a gene is in the gene subset when K = 100, then the same gene will also appear in the detected gene subset when K = 500 or K = 1000.This property is desirable as we will not get contradictory results when using different K.If a gene is among the top 500 important genes for a disease of interest, yet it is not among the top 100 important genes, it will confuse researchers who use the model to search gene subset of different size.Figure 4-b shows the F1 score and classification accuracy of PathFormer model on AD classification datasets.Though we observe a decrease of classification accuracy on Rosmap when K increases from 100 to 500, the improvement of prediction results is observed in other situations, as keeping more genes in KD-Sortpool can help reduce the information loss.Figure 4-a,c also compare the pattern of detected gene subset for different diseases.As Figure 4-a shown, no matter which K is used, the patterns of detected gene subsets from the AD datasets (i.e.Mayo & RosMap) are very similar, yet they are different from the pattern of detected gene subsets from the cancer dataset.To quantitatively describe this observation, Figure 4-c computes the overlap size of detected gene subsets for same/different disease, and we find that the overlap size is significantly large when detected gene subsets are related to the same disease/phonetype.
The effectiveness of deep learning models in analyzing graphs is usually affected by graph properties.For instance, Morris et al.Morris et al. [2019] and Xu et al. Xu et al. [2019] show that message passing GNNs cannot be more powerful than 1-dimensional Weisfeiler-Lehman (1-WL) algorithm Leman and Weisfeiler [1968] in distinguishing non-isomorphic graphs, thus these GNNs will ignore the cyclical information and cannot provide desirable prediction results on social networks where cycles are critical features.In this section, we discuss some properties of gene networks that explain the subpar performance of existing deep learning models.Doig [2003].We find that the average node degrees in graphs from popular graph datasets is usually smaller than 10, while that in gene networks are usually larger than 25.On other hand, gene networks are large-scale graphs and usually contains more than 3000 genes, while popular graph benchmarks usually consider small-scale graphs.
Two severe consequences can be caused by above properties.(1) First, the extremely large node degrees will lead to the over-squashing problem Alon and Yahav [2020], which states that the receptive field of nodes will grow exponentially with the number of GNN layers.Since the size of the receptive field reflects how much information to encode, then dominant GNNs are susceptible to a bottleneck as they aggregate too much information to a single vector, where the exponentially growing information are squeezed into fixed-size vectors.
(2) Second, as popular GNNs are limited in their expressivity to encode all graph information, a large body of works is proposed to design more powerful GNNs.Basically, these works can be categorized into two groups: subgraph-based GNNs Zhang and Li [2021] and high-order GNNs Grohe [2021], and the complexity of them is at least O(n 2 ), where n is the graph size.Hence, these advanced deep learning models to enhance the prediction performance will have the space/time complexity issue when applied to large-scale graphs like gene networks.It is proven that the power of popular GNNs comes from the ability to filter out the high-frequency components in node features, and popular GNNs essentially act as a low-pass filter on graphs.However, when we perform the same experiment on gene network datasets, orange curves in the left of Figure 3-b indicates that gene networks do not have the low-path nature, thus both the lowfrequency components and high-frequency components are important.Thus, popular GNNs that filter out high-frequency components of node features in graphs are not suitable for gene networks.

Theoretical Results
In this section, we provide a mathematical formulation of graph machine learning problem, which illuminates the theoretical solution to design a GNN architecture that can address the over-squashing problem and the absence of the low-path nature.
.., N } be a set of graphs, where V n and E n contain nodes and edges information in graph G n .For a graph G n , each node/gene i ∈ V n as a d-dimensional feature x i , while the graph G n has a label y n to predict.We use A to denote the adjacency matrix, and D to denote the diagonal matrix such that D ii = j A i,j .Then we set D = D + I.
Definition 4.1 (Optimization formulation) Let X be the output of a GNN model/layer f such that X = f (X, A), where X ∈ R n×d is the input feature matrix.Let N (i) = j ∈ V n |(i, j) ∈ E n be the set of neighbors of node/gene i.Then the unconstrained optimization problem is formulated as following, Where operation D-inner product is defined as ||x|| D = (x, x) D .The first term in the objective of the optimization formulation constrains that output of GNN should not be too far off the input, while the second term indicates that the formulation essentially is a type of Laplacian smoothing over the whole graph, where p i,j characterizes the similarity between a node/gene pair (i, j).
Theorem 4.2 The optimal solution X * to the optimization formulation solves the challenge of absence of low-path nature.Let M denote the mask matrix such that M i,j = 1 if j ∈ N (i) and M i,j = 0 otherwise.Then M P X is the first-order approximation of X * .
We prove Theorem 4.2 in Appendix D. Basically, this theorem provides insights to design a GNN that generate first-order approximation to the optimal solution, and the key problem is how to generate the trainable parameter matrix P .The straightforward solution is to use the global (self-)attention mechanism such that P i,j = g(Xi,Xj ) j g(Xi,Xj ) , as it has been shown that the attention mechanism can help to solve the over-squashing problem at the same time [21].Thus, theorem 1 can be used to design GNN layer that addresses the over-squashing problem and the challenge of lacking low-path nature.In the proposed PathFormer encoder layer, MP is estimated through function ( 3), (4), ( 5), ( 6), ( 7), thus it brings a significant improvement of prediction results.

Conclusion
In this paper, we introduce an interpretable graph Transformer, PathFormer.PathFormer significantly outperforms strong baselines, including dominant GNNs, recent graph Transformers, and popular gene-netwrok-specific deep learninig models.In conclusion, PathFormer can achieve highly accurate disease diagnosis accuracy and can identify a stable set of biomarkers in the omics data-driven studies.

A Transformer and Graph Transformer
A .1 Transformer on Graphs Transformer Transformer Vaswani et al. [2017] solves the language modeling problem [Dai et al., 2019, Al-Rfou et al., 2019, Devlin et al., 2019, Lewis et al., 2020] using self-attention mechanism, and improves the performance over RNN-based or convolution-based deep learning models in both accuracy and efficiency.The Transformer encoder consists of a stack of Transformer encoder layers, where each layer is composed of two sub-networks: a (multi-head) self-attention network and a feed-forward network (FFN).let H = (h T 1 , h T 2 , ...h T n ) be the input to a Transformer encoder layer.In the self-attention network, the attention mechanism takes H as input and implements different linear projections to get the query matrix K, key matrix K and value matrix V , Then the attention matrix A is computed as following to measure the similarities, which is then used to update the representation in parallel .
After the self-attention network, the feed-forward network consists of two linear transformations with a Rectified Linear Unit (ReLU) activation in between to generate the output.i.e.O = FFN(Z).
The FFN is composed of a standard Dropout Layer → Layer Norm → FC (fully connected) Layer → Activation Layer → Dropout Layer → FC Layer → LayerNorm sequence, with residual connections from Z to after the first dropout, and from before the first FC layer to after the dropout immediately following the second FC layer.
Transformer on graphs Recently, the Transformer architecture is applied to graph learning tasks to avoid structural inductive bias caused by GNNs Dwivedi and Bresson [2020], Kreuzer et al. [2021], Mialon et al. [2021].To incorporate the topology information of graphs, various positional encoding schemes are proposed to encodes structural relations Dong et al. [2022] or positional information Ying et al. [2021] about nodes.On the other hand, other works Wu et al. [2021], ?are designed to embed structural similarities between nodes by learning input node features with GNNs.These graph Transformers have achieved great success and sit atop leaderboards such as the Open Graph Benchmark Hu et al. [2020].

B Other Backgrounds
Interpretable GNN: The interpretable GNN aims to show a transparent and understandable prediction process to humans.In other words, which parts of the input have a significant impact on the prediction?For a gene network, this could be genes, relationships between correlated genes, or a combination of both, i.e., motifs Xing et al. [2022].Most interpretable GNNs, such as GNNExplainer, take a local interpretable mechanism to explain the key subgraphs of each graph.One of the assumptions behind this type of interpretation is that there are input components that contribute significantly to the prediction, while the insignificant components have less impact.Such an assumption can lead to the fact that these interpretable GNNs do a poor job of discovering the important subgraphs when the genetic features or the relationships among these genes cannot be clearly distinguished.Furthermore, local interpretability treats GNNs as black boxes Kovalerchuk et al. [2021] thus limiting human trust in the given interpretation.
Low-path Nature of GNN: In general graph learning problems like semi-supervised node classification, node features x(i) are often regarded as signals on nodes, and techniques in graph signal processing Ortega et al. [2018] are then leveraged to understand the signal characteristics.Various prior works Hoang et al. [2021], Zhu et al. [2021], Pan et al. [2020] assume or observe that node features x(i) consist of low-frequency true features and noises.Based on the assumption, numerous GNNs are designed to decrease the high-frequency components in node features, thus essentially acting as a low-pass filter on graph signals.However, the assumption is not verified on gene networks.
Figure 1 shows that gene networks do not benefit from omitting high-frequency components in signals.This means that the low-pass nature might not exist in the studied problem, and only keeping low-frequency signals might degrade the performance of GNNs due to the information loss.

C Details of deep learning baselines
In GNN baselines, all graph convolution layers have a feature dimension of 128; The number of graph convolution layer is selected from the set {2, 3, 4}; The graph-level readout function is selected from the function set {mean, sum, average}.In graph pooling models, SAGpool and Diffpool take two pooling layers, where the first layer keeps 500 genes and second layer keeps 100 genes.SortPool contains one pooling layer, which sorts genes according to the learnt embeddings and then keeps the top 100 gene embeddings as the input to a CNN model.Other parameter settings in graph pooling models follow their original papers.In graph TransFormers, the number of encoder layers is 3, the dimension d k is set to be 16, the number of heads is set to be 4. Due to the property of bioinformatical dataset, Graphormer does not perform the pre-training.In graphTrans, we use GIN with 2 layers to extract the node embeddings.
D Details of Definition 4.1 and Proof of Theorem 4.2.
Given a graph G = (V, E) with |V| = n.Let J ∈ R n×n be an all-ones matrix where every element is equal to one.p ∈ R n×n is the parameter matrix whose element (i, j) is p i,j .Then, we use p In addition, we use operation • to denote the Hadamard product of two matrices.Then the objective function f in problem ?? can be formulated as: where S and L are diagonal matrix such that S i,i = j (M Then the first order approximation (through the Neuman series of a matrix) is D−1 (M • p)X and is D−1 (M • p)X with a constant term when when the constraints are not normalized as problem ??.
Since D−1 can be viewed as a normalization term, the solution can be simplified as M • P X where P ∈ R n×n is a trainable parameter matrix.

E Additional Interpretation Results
The interpretability of deep learning models is of vital importance for real-world applications in the field of bioinformatics, as the interpretability can illustrate the chain of reasoning to aid in trust and to increase the testability.In aforementioned gene-network-based bioinformatical tasks, the prevailing interpretation methods aim to extract sub-networks consists of core gene pathways Preuer et al. [2018] that reveals the underlying functional mechanism.
E.1 Gene set enrichment analysis.
Gene set enrichment analysis is ubiquitous in genetic research to reveal the association of a set of genes and disease/phenotypes of interest.We implement conducted GO & KEGG to perform the gene set enrichment analysis on the detected gene subset for AD, and results are reported in c and d in Figure 5.
Figure 5 c illustrates the enrichment analysis results from KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathways, and top 20 significant pathways (that with the smallest adjusted p-value) are presented.We find that the detected biological pathway of the greatest significance is (Alzheimer's disease) homo sapiens hsa 05010, which belongs to the class of neurodegenerative-disease pathways and is associated with Alzheimer's disease.c Enrichment analysis on the detected core gene subset finds significant pathways associated with Alzheimer's disease.d Go term analysis on the detected core gene subset finds significant biological process associated with Alzheimer's disease.
• BP (Biological Process) results: Many reported biological processes are associated with AD, including negative regulation of amyloid-beta formation, aging, astrocyte activation, positive regulation of neuron death, cellular response to amyloid-beta, cellular response to copper ion, learning or memory.More specifically, amyloid-beta (Aβ) pathology constitutes an essential mechanism for AD in a person; Older age is one of the most important risk factor for AD; Astrocytes become activated around the senile plaques in AD brain; Neuronal cell death like autophagy-dependent cell death is known to be related to AD; AD can cause a faulty blood-brain barrier that prevents the clearing away of toxic beta-amyloid; Deregulated copper ions may ultimately contribute to synaptic failure, neuronal death, and cognitive decline observed in AD patients; Incapability to create new memories is often among the very first signs of AD.
• CC (Cellular Component) results: Many reported cellular components have been proven to be associated with AD, including extracellular exosome, neurofibrillary tangle, synapse, membrane raft, recycling endosome, dendrite.Among them, neurofibrillary tangle is related to about 0.6 percentage of genes in the detected gene subset, while others only cover a very small proportion of genes in the subset (i.e.significantly smaller than 0.1).To explain a bit, neurofibrillary tangle is abnormal accumulations of the tau protein, while the faulty blood-brain barrier caused by AD will prevent the clearing of tau protein.
• MF (Molecular Function) results: In the reported molecular functions, both amyloid-beta binding and tau-protein binding are associated with AD and cover more than 0.1 percentage genes in the detected gene set.We have explained the association of tau-protein and AD in CC results, and explained the association of amyloid-beta and AD in BP results.The attention mechanism in PathFormer encoder layers provide a built-in approach to interpret the relation/attention strength between genes in the detected gene subset using function ( 14). Figure 5-b visualizes the normalized attention strength matrix of every gene pair.We find that some target genes receive much more attentions from source genes.In the heatmap figure, we can identify these genes by searching along the x-axis for target genes whose attention vector (a line in the heatmap) contains many large values.CLU, TPP1, PSEN1, PICALM, APP are some examples.These genes usually have a relatively large GDA score, which quantities the association of a gene and AD.Here, GDA(APP) = 0.9, GDA(PSEN1) = 0.7, GDA(CLU) = 0.5, GDA(TPP1) = 0.3.Hence, these genes shows a tendency for interaction with other genes in the prediction process.

F Comparison of gene expressions of AD and control in different datasets
This experiment performs the independent t samples test to compare the gene expressions of AD samples and control samples in two datasets: MAYO and ROSMAP.The objective is to show the reason why our proposed Knowledge-guided Disease-specific Sortpool (KD-Sortpool) is independent of gene expressions.In the context of bioinformatics, disease-specific biomarker/gene ranking is based on a group of samples (AD samples vs control samples).Then the ranking function will tend to select genes whose expression values are proposed to be significantly lower/higher in the AD group than the control group.Consequently, we evaluate whether the difference of mean gene expression values in AD group and control group is consistent across datasets of the same disease.
Figure 6 demonstrates the experimental results.We find that the t-values show different pattern in dataset MAYO and ROSMAP.For many genes, the t value is significantly larger than 0 on dataset MAYO when it is significantly smaller than 0 on dataset ROSMAP, and vice versa.This observation indicates that if a disease-specific group-based ranking function is used to select the genes of importance, it will detect different gene patterns in dataset MAYO and ROSMAP, which can lead to contradictory conclusion since both MAYO and ROSMAP are used to study the Alzheimer's disease.

Figure 2 -
Figure 2-b illustrates the overall architecture.Specifically, PathFormer consists of a Knowledgeguided Disease-specific Sortpool layer (KD-Sortpool layer) and several PathFormer encoder layers.Based on the prior knowledge of a particular disease, the KD-Sortpool layer select top K genes as the gene subset for the purpose of outcome interpretation.Then PathFormer encoder layers iteratively update features of each gene by aggregating its' neighbors' gene features.In the end, PathFormer summarize all gene features through a MLP (multiple layer perceptron) to generate a vector embedding of the input gene network, and then the vector embedding is used to predict the disease or a particular phenotype.

Figure 2 :
Figure 2: Architecture overview.a. introduces the proposed PathFormer encoder layer.PathFormer encoder layer consists of a Pathway-enhanced Attention Mechanism (PAM) and a subsequent feedforward network (FFN).Compared to a standard attention mechanism, PAM utilizes the proposed SNPMF (Signaling Network Pathway Modeling Framework) to generate vector embeddings of pathways around each gene, which are then concatenated with gene features to compute the key matrix and query matrix.b. illustrates the overall architecture of the PathFormer model.PathFormer model is composed of a knowledge-guided disease-specific Sortpool (KD-Sortpool) layer and a stack of PathFormer encoder layers.It takes gene network of patients as input and outputs predictions of disease/ phenotype as well as gene subset for biological interpretations.

Figure 3 :
Figure 3: a Gene networks always have significantly larger graph size and cardinality than popular graphs, which causes severe over-smoothing problem in graph machine learning.b Popular graphs are always treated as signals consist of a low frequent true feature and a high frequency noise.Hence, the low-path nature indicates graph neural networks can be designed to filter out high frequency component to achieve good performance.However, gene networks do not have the low-path property.c PathFormer addresses problems in a and b, thus significantly improving the prediction results over existing state-of-the-art (SOTA) deep learning models.

Figure 3
Figure 3-b: absence of the low-path nature.In graph machine learning, the node features are often regarded as signals on nodes Ortega et al. [2018].Various prior works Hoang et al. [2021], Zhu et al. [2021], Pan et al. [2020] have observed that node features of graphs in popular graph datasets consist of low-frequency true features and high-frequency noises.This property is called the low-path nature, and Figure 3-b illustrates the property.This figure reports the average performance of 2-layers MLPs on frequency-limited feature vectors, where node features are first transformed to the graph Fourier space and then the features vectors used for prediction is reconstructed based on top frequency components.Green curves in the right of Figure 3-b implies that true features (i.e.low-frequency components of node features in graphs) have sufficient information for graph machine learning on popular graph datasets.

Figure 5 d
Figure 5 d presents the enrichment analysis results from GO (gene ontolog)Thomas et al. [2022].The p-value that characterize the significance of biological process/cellular component/molecular function is visualized as color, and we only report GO terms of a p-value smaller than 0.05.

Figure 5 :
Figure5: a The KD-Sortpool layer in PathFormer select 100 core genes as a gene subset to explain Alzheimer's disease.b PathFormer can compute the relation strength (attention) between selected genes.Some genes (e.g.TPP1, PSEN1, CLU, APP) gain significant larger attention from other genes, and these gene usually have a large GDA score and are closed associated with Alzheimer's disease.c Enrichment analysis on the detected core gene subset finds significant pathways associated with Alzheimer's disease.d Go term analysis on the detected core gene subset finds significant biological process associated with Alzheimer's disease.

Figure 6 :
Figure 6: Comparison of t statistics of each gene in dataset MAYO and ROSMAP.
Kanehisa and Goto [2000]18].The objective is to distinguish Alzheimer's disease (AD) samples from normal elderly controlsCustodio et al. [2022].Mayo dataset is composed of control tissue samples and AD pathological aging samples, while ROSMAP dataset contains control samples and AD dorsolateral prefrontal cortex samples.The gene features in Mayo and RosMap are first mapped to the reference genome using STAR (v.2.7.1a), and then the transcriptomic (TPM) values of 16, 132 common protein-coding genes are obtained in both datasets by applying the Salmon quantification tool in alignment-based RNA-seq data.The Mayo dataset contains 158 graphs, each including 3000 genes, while the Rosmap dataset contains 357 graphs, each also including 3000 genes.The edges between genes are collected from KEGG (Kyoto Encyclopedia of Genes and Genomes) databaseKanehisa and Goto [2000]based on the physical signaling interactions from documented medical experiments.According to the Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org/),anytwointerrelated genes are undirected.Cancer datasets (Cancer):To understand differences in biological mechanisms among cancer subtypes, we design the Cancer dataset.Cancer dataset aims to predict the type of caner samples based on the gene network structure and gene features.Gene features and cancer labels are extracted from the Xena server: https://xenabrowser.net/.The edges between genes are also collected from KEGG.Patient samples are collected from the longevity dataset.This dataset contains 18 different typical cancer types, including uterine carcinosarcoma, thyroid carcinoma, acute myeloid leukemia, skin cutaneous melanoma, thymoma, testicular germ cell tumor, stomach adenocarcinoma, sarcoma, rectum adenocarcinoma, prostate adenocarcinoma, pancreatic adenocarcinoma, ovarian serous cystadenocarcinoma, lung adenocarcinoma, liver hepatocellular carcinoma, mesothelioma, kidney clear cell carcinoma, head & neck squamous cell carcinoma, uterine corpus endometrioid carcinoma.
3.1 Datasets and MetricesAlzheimer's disease datasets (Mayo and Rosmap):Two datasets, Mayo and Rosmap, are used as the benchmark datasets of Alzheimer's disease prediction problem in bioinformatics Allen et al.[

Table 1 :
Prediction results of proposed PathFormer model and deep learning baselines.Best results are highlighted.Four types of deep learning models are used as baselines and they are visualized by different colors in the table: (1) popular GNNs that achieve leading positions are mark as yellow ; (2) dominant graph pooling models for subgraph extraction are marked as green ; (3) the state-of-the-art graph Transformers are marked as orange ; (4) existing powerful deep learning models for analyzing gene networks in other bioinformatical tasks are marked as blue.Table 1 reports the comparison results of our PathFormer model and all baseline DL models using two evaluation metrics: classification accuracy and F1 score.Figure 3-c compares PathFormer and the best existing DL model.In the experiment, the KD-Sortpool in PathFormer model keeps all genes.The experimental results show that our PathFormer model can consistently and significantly improve the prediction result over all baselines.The average improvement of prediction accuracy is at least 38% on AD classification and 23% on cancer classification.