Identifying Cancer Subtypes Using a Residual Graph Convolution Model on a Sample Similarity Network

Cancer subtype classification helps us to understand the pathogenesis of cancer and develop new cancer drugs, treatment from which patients would benefit most. Most previous studies detect cancer subtypes by extracting features from individual samples, ignoring their associations with others. We believe that the interactions of cancer samples can help identify cancer subtypes. This work proposes a cancer subtype classification method based on a residual graph convolutional network and a sample similarity network. First, we constructed a sample similarity network regarding cancer gene co-expression patterns. Then, the gene expression profiles of cancer samples as initial features and the sample similarity network were passed into a two-layer graph convolutional network (GCN) model. We introduced the initial features to the GCN model to avoid over-smoothing during the training process. Finally, the classification of cancer subtypes was obtained through a softmax activation function. Our model was applied to breast invasive carcinoma (BRCA), glioblastoma multiforme (GBM) and lung cancer (LUNG) datasets. The accuracy values of our model reached 82.58%, 85.13% and 79.18% for BRCA, GBM and LUNG, respectively, which outperformed the existing methods. The survival analysis of our results proves the significant clinical features of the cancer subtypes identified by our model. Moreover, we can leverage our model to detect the essential genes enriched in gene ontology (GO) terms and the biological pathways related to a cancer subtype.


Introduction
Cancer is a heterogeneous disease initiated by random somatic mutations and driven by multiple genomic alterations [1][2][3]. Cancer patients are usually divided into subtypes according to their molecular profiles for effective cancer treatment [4][5][6]. Different subtypes of cancer patients have different clinical phenotypes, tumor morphologies, therapeutic schedules and prognoses [7]. For example, breast cancer is divided into four subtypes: Luminal A, Luminal B, Basal and HER2. Each subtype has different forms and different reactions to drugs [8][9][10][11]. Accurate cancer subtyping can help understand the pathogenesis of cancer, boost clinical treatment, improve patients' survival rate and advance research on cancer genomics and precision medicine [12]. In the past decade, some large-scale cancer genomics projects have published genomic, epigenomic, transcriptomic, and proteomic data from thousands of cancer patients [13]. These projects include the cancer genome atlas (TCGA) [14], the International Cancer Genome Consortium (ICGC) [15], and the Pan-Cancer Analysis of Whole Genomes (PCAWG) [16]. These cancer genomic data have extensively promoted the development of computational methods for cancer subtype classification by integrating multi-omic data. Previous studies [17] showed that the gene expression data perform best in detecting the cancer subtypes compared with other omics work, we designed an end-to-end deep learning method, namely ERGCN, which uses a cancer sample similarity network and a Residual Graph Convolutional Network to classify cancer subtypes based on gene expression profiles. Our model utilizes the gene expression profiles as sample features and constructs a sample similarity network regarding their gene co-expression patterns. Then, the sample features and sample similarity network are put into a residual graph convolutional network model to learn the feature representation of samples and to predict subtypes in an end-to-end way. We applied our model to predict subtypes for breast, GBM and lung cancer samples. The results show that our method achieves relatively better performance than the state-of-the-art models in terms of both internal and external evaluation metrics. Moreover, by analyzing the prediction results for breast cancer, we found some key genes that may cause the occurrence of breast cancer subtypes, demonstrating the practicality of ERGCN.

Materials and Methods
Broadly, ERGCN takes two steps to identify cancer subtypes of cancer patients based on their gene expression data and the residual graph convolutional network. Firstly, it calculates the similarity between cancer patients and constructs a network where nodes are cancer patients and edges connect two nodes if their similarities are above a predefined threshold. Next, a residual graph convolutional neural network algorithm is adopted to diffuse the node feature information through the network and to learn the feature representation for every node in an end-to-end pattern. At this step, ERGCN predicts the cancer subtypes of patients according to their feature representations. Figure 1 shows an overview of ERGCN for identifying cancer subtypes. implemented a graph convolution operation on multiple pathway-gene networks to learn gene features, and then leveraged a multi-attention based ensemble model to combine these features in several hundreds of pathways to identify cancer sample subtypes. In this work, we designed an end-to-end deep learning method, namely ERGCN, which uses a cancer sample similarity network and a Residual Graph Convolutional Network to classify cancer subtypes based on gene expression profiles. Our model utilizes the gene expression profiles as sample features and constructs a sample similarity network regarding their gene co-expression patterns. Then, the sample features and sample similarity network are put into a residual graph convolutional network model to learn the feature representation of samples and to predict subtypes in an end-to-end way. We applied our model to predict subtypes for breast, GBM and lung cancer samples. The results show that our method achieves relatively better performance than the state-of-theart models in terms of both internal and external evaluation metrics. Moreover, by analyzing the prediction results for breast cancer, we found some key genes that may cause the occurrence of breast cancer subtypes, demonstrating the practicality of ERGCN.

Materials and Methods
Broadly, ERGCN takes two steps to identify cancer subtypes of cancer patients based on their gene expression data and the residual graph convolutional network. Firstly, it calculates the similarity between cancer patients and constructs a network where nodes are cancer patients and edges connect two nodes if their similarities are above a predefined threshold. Next, a residual graph convolutional neural network algorithm is adopted to diffuse the node feature information through the network and to learn the feature representation for every node in an end-to-end pattern. At this step, ERGCN predicts the cancer subtypes of patients according to their feature representations. Figure  1 shows an overview of ERGCN for identifying cancer subtypes. Input the sample features and the sample adjacency matrix into the residual graph convolution model to obtain category prediction.

Datasets
We tested our model on three different cancer types from TCGA. These three cancer types included breast invasive carcinoma (BRCA), with 102 samples, glioblastoma multiforme (GBM), with 213 samples, and lung cancer (LUNG), with 85 samples. We

Datasets
We tested our model on three different cancer types from TCGA. These three cancer types included breast invasive carcinoma (BRCA), with 102 samples, glioblastoma multiforme (GBM), with 213 samples, and lung cancer (LUNG), with 85 samples. We downloaded the gene expression data and survival information for the three cancer types in supplementary files of Wang [38]. The cancer subtype information was retrieved from TCGA using the R package TCGAbiolinks [39]. By matching the sample ID of the gene expression data, we joined the sample gene expression data with their cancer subtypes. Hence, there were four cancer subtypes for the 102 breast cancer patients (e.g., Basal, LumA, LumB, Her2), four cancer subtypes for the 213 GBM cancer patients (e.g., LGr1, LGr2, LGr3, LGr4) and four cancer subtypes for the 85 lung cancer patients(e.g., Basal, Classical, Secretory, Primitive).

Network Construction
We constructed a patient network based on the similarity of patients. The similarity of patients was calculated by the Pearson correlation coefficient (PCC) [40] of their gene expression profiles (see Equation (1)).
where n is the number of genes, X i and Y i are the expression values of gene i for patients X and Y and X and Y are the average gene expression values of X and Y. The Pearson correlation coefficient (r) measures the degree of linear correlation between two patients, ranging from −1 to 1. When the absolute value of r nears 1, two patients are positively or negatively correlated. Otherwise, the value nears 0. Hence, we consider that an edge connects two patients and thus set the corresponding values in the adjacency matrix to 1 if the absolute value of the Pearson coefficient between the two samples was greater than the threshold (θ). Otherwise, no edges connected two patients, and the corresponding adjacency matrix values were zeros. In this work, θ was set to 0.42 for BRCA, 0.41 for LUNG and 0.8 for GBM.

Residual Graph Convolutional Neural Network
After constructing the patient network, we adopted a graph convolutional network (GCN) to learn the feature representations of the network nodes by considering the network structure and node attributes. A GCN model requires two inputs, an adjacent matrix storing the node connections and a node initial attribute matrix. Let G = (V, E) be a patient network, where V (|V| = n) and E are sets of nodes and edges, respectively. The relationship between each node forms an n × n adjacency matrix A. We assume that every node connects to itself. Hence, the diagonal elements of A are set to 1. D is the degree matrix of A, X is an n × d initial attribute matrix of network nodes, and d is the length of the node's attributes. We regard the patient's gene expression data as the initial node attributes. The GCN model gathers features for a node from itself and local neighbors, then propagates information among nodes and the network. Mathematically, a single GCN layer is defined as follows.
where H (l+1) represents node features learned by the (l + 1)th GCN layer, W ∈ R n * h is a weight parameter matrix of the neural network, h is the hidden layer dimension, H (l) comes from the output feature embedding of the previous GCN layer, H (0) is initialized with X for the first GCN layer, and p is the nonlinear activation function.
To avoid the over-smoothing issue of the GCN model and to emphasize the original node features, we learned from the concept of ResNet [41] and added a layer of residual data to the GCN. To ensure the consistency between the dimension of the input feature and the node feature dimension of a layer GCN, we passed the initial input feature through an independent linear layer and directly connected it to the output of the GCN layer. Mathematically, the skip connection operation can be written as follows.
where H (1) represents the node features learned by the first GCN layer, and linear(X) denotes passing the initial input feature matrix into a linear layer. The ERGCN model consists of two GCN layers. The activation function of the first GCN layer is Relu. We took the gene expression profiles of the samples as initial attributes and input them into the first GCN layer. Before feeding the output of the first GCN layer into the second GCN layer, we added the previous output with the initial node attributes that underwent a linear change to overcome over-smoothing (see Equation (3)). The second GCN layer adopted the softmax activation function, whose outputs are the probabilities of a patient belonging to every cancer subtype. We used the cross-entropy loss functions to quantify the cancer subtype prediction loss.
Given a patient d in the training set T, Y df is a sign function (0 or 1), Y df equals one if the real class label of the patient d is one, otherwise, Y df equals zero, Z df is the predicted probability that cancer patient d is in the category f, which is the output of the second GCN layer. Support existed for a number F of cancer subtypes. The outputting dimension of the second GCN layer was F. We minimized the cancer subtype prediction loss to optimize the parameters in the ERGCN model.
In summary, Algorithm 1 illustrates the pseudocode of ERGCN.

Algorithm 1 ERGCN
Input: Gene expression matrix X ∈ R m×n with m number of samples whose vector length is n, corresponding true labels Y ∈ R m , number of epochs e, learning rate η, dropout rate d. Output: Predicted labels Y. 1.Use Equation (1) to calculate the correlation between samples based on gene expression data to get the correlation matrix A (m * m).
2.Given a threshold θ, set the value of the matrix A greater than θ to 1, and set other values to 0 to obtain the adjacency matrix A (m * m).
3.For i = 1 to epochs do: Calculate the Loss by Equation (4). Update the weights of ERGCN by gradient descent and back propagation. end for

Experimental Parameters
For ERGCN, we set the dimension of the first layer of the graph convolution to 64 and the dimension of the second layer of the graph convolution to the number of categories. We use the Adam optimizer function. The learning rate is set to 0.001. We performed 5-fold cross-validation ten times and compared the average results. The network structure of SAE was the input layer-500-200-50, the pre-training epochs of each layer were 20, and the final fine-tuning epochs were 40. The network structure of VAE was the input layer-512-512-128, with epochs of 300.

Assessing the Performance 2.5.1. External Evaluation Metrics
The external evaluation index evaluates the effectiveness of the algorithm by comparing the predicted classification results with the real ones. We use the adjusted Rand index (ARI) to evaluate the distribution match between the classification results and the known benchmark subtypes. In addition to this, we also used the Accuracy, Precision, Recall, F1 Score, MCC indicators for comparison. The formulas of these indicators are as follows: where TP refers to positive samples predicted by the model as positive, TN refers to negative samples predicted by the model as negative, FP refers to negative samples predicted by the model as positive, and FN refers to positive samples predicted by the model as negative.

Internal Evaluation Metrics
We used two internal evaluation metrics, Silhouette width and the Davies-Boulding Index(DBI), to assess the clustering quality without knowing the classification labels. The Silhouette width was calculated using Formula (11): where a(i) is the average distance between sample i and other samples of the same type, and b(i) is the average distance between sample i and all samples of different types. The value of the Silhouette width is in (−1,1). The larger the Silhouette width value, the higher the similarity of nodes within the same class, and the lower similarity of nodes between classes. The DaviesπBoulding Index formula was defined by Formula (12): where N is the number of clusters, avg(C i ) is the average distance between sample i and its cluster centroid, dis(C i , C j ) represents the distance between the center of class C i and the center of class C j . The lower limit of the DBI is 0, and the smaller the DBI value, the better the clustering.

Results
To verify the effectiveness of ERGCN, we compared its cancer subtype prediction performance with some state-of-the-art methods in terms of internal and external evaluation indicators. The existing methods include stack autoencoder (SAE) [29], variational autoencoder (VAE) [30], Deeptype [35], GCN+PPI, support vector machine (SVM), Random Forest and GcForest [42]. VAE and SAE reduce the dimension of cancer gene expression data and learn the feature representation of cancer patients. Then, cancer patients are classified according to their learned feature representation through the classifiers SVM and GcForest [42]. Deeptype is the latest end-to-end method that combines supervised classification and unsupervised clustering to learn cluster structure representations for cancer-related data and identify cancer subtypes. GCN+PPI conducts a graph convolution operation on the PPI network to learn sample features and predict cancer subtypes [43]. We also investigated the effect of the correlation coefficient threshold on our model and experimented with new sample discrimination to verify the stability of our model. Finally, we performed difference analysis and functional enrichment analysis of cancer subtype-related genes on BRCA to explore potential cancer treatment targets.

Determination of Correlation Coefficient Threshold
The first step of our model was to construct a patient network by calculating the PCC between patients. The parameter θ controls whether or not there is an edge connecting two patients. To test the effect of parameter θ, we compared the predictive performance of our model by setting θ to various values ranging from 0.1 to 0.9. We observed from Figure 2 that the performance of our model rose quickly with the increase of θ at the beginning. After that, it was relatively steady with different θ values. Our model performed best on the BRCA data set when θ was 0.42, where the ACC, MCC, F1 Score, Precision and Silhouette width reached 0.825, 0.743, 0.771, 0.790, and 0.792, respectively. Our model worked best on the GBM data set when θ was 0.8. Its ACC, MCC, F1 Score, Precision and Silhouette width achieved 0.851, 0.801, 0.841, 0.851, 0.763, respectively. On the LUNG data set, the performance of our model was relatively good when θ was 0.41. The ACC, MCC, F1 Score, Precision and Silhouette width were 0.792, 0.716, 0.722, 0.754, and 0.726, respectively. Hence, the parameter θ of our model was set to 0.42 in the BRCA data set, 0.8 in the GBM data set and 0.41 in the LUNG data set when we compared it with other methods. patients are classified according to their learned feature representation through the classifiers SVM and GcForest [42]. Deeptype is the latest end-to-end method that combines supervised classification and unsupervised clustering to learn cluster structure representations for cancer-related data and identify cancer subtypes. GCN+PPI conducts a graph convolution operation on the PPI network to learn sample features and predict cancer subtypes [43]. We also investigated the effect of the correlation coefficient threshold on our model and experimented with new sample discrimination to verify the stability of our model. Finally, we performed difference analysis and functional enrichment analysis of cancer subtype-related genes on BRCA to explore potential cancer treatment targets.

Determination of Correlation Coefficient Threshold
The first step of our model was to construct a patient network by calculating the PCC between patients. The parameter θ controls whether or not there is an edge connecting two patients. To test the effect of parameter θ, we compared the predictive performance of our model by setting θ to various values ranging from 0.1 to 0.9. We observed from

Results of External Evaluation Metrics
We construct a sample similarity network using all samples. Then we divided all samples into five parts. Four parts were selected as the training set, and the remaining part was the test set. We took the average result of ten iterations of the 5-fold crossvalidation test set as the evaluation metric. Tables 1-3 show the performance comparison between our model and other models regarding Accuracy, Recall, F1 Score, ACC, MCC, and ARI on the BRCA, GBM, and LUNG data sets. We observed the significant outperformance of our model compared with other models. On the BRCA data set, our Accuracy rate, F1 Score and MCC values were 2.94%, 14.48% and 5.03% higher than that of Gcforest, which has the best performance among the existing methods. Similar results were observed on the GBM and LUNG data sets. Our model's Accuracy, F1 Score and MCC values ere 1.47%, 1.88% and 5.05% higher on the GBM dataset and 2.35%, 8.13%, 1.88% higher on the LUNG dataset compared to Gcforest, which works best among the existing methods on the two datasets.

Results of External Evaluation Metrics
We construct a sample similarity network using all samples. Then we divided all samples into five parts. Four parts were selected as the training set, and the remaining part was the test set. We took the average result of ten iterations of the 5-fold cross-validation test set as the evaluation metric. Tables 1-3 show the performance comparison between our model and other models regarding Accuracy, Recall, F1 Score, ACC, MCC, and ARI on the BRCA, GBM, and LUNG data sets. We observed the significant outperformance of our model compared with other models. On the BRCA data set, our Accuracy rate, F1 Score and MCC values were 2.94%, 14.48% and 5.03% higher than that of Gcforest, which has the best performance among the existing methods. Similar results were observed on the GBM and LUNG data sets. Our model's Accuracy, F1 Score and MCC values ere 1.47%, 1.88% and 5.05% higher on the GBM dataset and 2.35%, 8.13%, 1.88% higher on the LUNG dataset compared to Gcforest, which works best among the existing methods on the two datasets.

Results of Internal Evaluation Metrics
We repeated the five-fold cross verification ten times. Table 4 reports the average Silhouette width and DBI of our model and other models for the test samples on the three data sets. The smaller the DBI index, the larger the Silhouette width and the more reasonable the classification result. It can be seen from Table 4 that our model keeps excellent performance in terms of compactness and separation. The Silhouette width of ERGCN reached 0.795, 0.763 and 0.727 on the BRCA, GBM and LUNG data sets, respectively, which was 17.24%, 34.32% and 24.49% higher than DeepType, which has the best internal evaluation values among the existing methods. Moreover, ERGCN resulted in 10.24%, 40.24% and 24.65% lower DBI values than DeepType on the BRCA, GBM and LUNG data sets, respectively. We use the t-SNE tool to visualize the initial features of the cancer samples in the test set and on their latent features learned by the ERGCN model. Figure 3 illustrates that the cancer samples can be separated into several subgroups well when using the features learned by our ERGCN model. These results prove the effectiveness of our model on cancer subtype classification.  Figure 3 illustrates that the cancer samples can be separated into several subgroups well when using the features learned by our ERGCN model. These results prove the effectiveness of our model on cancer subtype classification.

Experiments with New Samples
ERGCN classifies cancer subtypes based on the sample similarity network, which calculates Pearson correlations between all pairs of samples ahead of time. To probe the performance of ERGCN on a new sample, which is a popular application in a clinical

Experiments with New Samples
ERGCN classifies cancer subtypes based on the sample similarity network, which calculates Pearson correlations between all pairs of samples ahead of time. To probe the performance of ERGCN on a new sample, which is a popular application in a clinical context of determining a person's cancer subtype, we selected a single sample as the test set and regarded the remaining samples as the training set. We trained the model on the network built by the training samples. After that, we augmented the network by calculating the correlation between a new sample and other samples when testing the model. We collected the results of all single samples. Other methods only need to select a single sample as a test set and input the rest of the samples into the model as a training set. Tables 5-7 reports the average external evaluation indicators of our method and other methods across all individual samples. We noticed that ERGCN had good performance compared to other methods, with Accuracy values reaching 0.804, 0.850 and 0.824 on the BRCA, GBM and LUNG data sets, respectively.

Survival Analysis
To further explore the relationship of identified subtypes, we conducted a survival analysis on the ERGCN results. Theoretically, different cancer subtypes should exhibit different survival curves. Figure 4 illustrates our model's Kaplan-Meier survival curves on the BRCA, GBM and LUNG datasets, whose abscissa represents time, and the ordinate represents the observed survival rate. We also plotted the median survival time on the curve and calculated the p-value of the log-rank test on the survival curves of different subtypes. On the BRCA data set, the median survival time of the first group was 1699; the second group was 3418 days; the third group was 611 days; the fourth group was NA days. NA means that most patients in the fourth group cannot survive the median survival time. The distance between each category was very long. For GBM, the difference between the subgroups was not very obvious. The median survival time of the first group was 668 days; the second group was NA days; the third group was 327 days. The fourth group was 271 days. On the LUNG data set, the median survival time of cluster 1 was 2082 days; for cluster 2 it was 1415 days; for cluster 3 it was 1088 days, and for cluster 4 it was 306 days. The distance between each category was long. Hence, there was a significant difference in the survival curves of the cancer subtypes identified by our model on the three cancer datasets.

Survival Analysis
To further explore the relationship of identified subtypes, we conducted a survival analysis on the ERGCN results. Theoretically, different cancer subtypes should exhibit different survival curves. Figure 4 illustrates our model's Kaplan-Meier survival curves on the BRCA, GBM and LUNG datasets, whose abscissa represents time, and the ordinate represents the observed survival rate. We also plotted the median survival time on the curve and calculated the p-value of the log-rank test on the survival curves of different subtypes. On the BRCA data set, the median survival time of the first group was 1699; the second group was 3418 days; the third group was 611 days; the fourth group was NA days. NA means that most patients in the fourth group cannot survive the median survival time. The distance between each category was very long. For GBM, the difference between the subgroups was not very obvious. The median survival time of the first group was 668 days; the second group was NA days; the third group was 327 days. The fourth group was 271 days. On the LUNG data set, the median survival time of cluster 1 was 2082 days; for cluster 2 it was 1415 days; for cluster 3 it was 1088 days, and for cluster 4 it was 306 days. The distance between each category was long. Hence, there was a significant difference in the survival curves of the cancer subtypes identified by our model on the three cancer datasets.

Ablation Study
ERGCN combines the GCN model and residual architecture to classify cancer samples. To investigate which parts contributed to ERGCN model's excellent performance, we conducted an ablation study on the BRCA, GBM and LUNG datasets. MLP and GCN are two variations of our model. MLP reduces the dimensionality of cancer gene expression data through a two-layer neural network and then directly makes a classification through the softmax activation function. GCN leverages the same framework as ERGCN without using the residual architecture. The parameter settings of MLP and GCN were the same as those of ERGCN. Table 8 lists the comparison of their

Ablation Study
ERGCN combines the GCN model and residual architecture to classify cancer samples. To investigate which parts contributed to ERGCN model's excellent performance, we conducted an ablation study on the BRCA, GBM and LUNG datasets. MLP and GCN are two variations of our model. MLP reduces the dimensionality of cancer gene expression data through a two-layer neural network and then directly makes a classification through the softmax activation function. GCN leverages the same framework as ERGCN without using the residual architecture. The parameter settings of MLP and GCN were the same as those of ERGCN. Table 8 lists the comparison of their external evaluation metrics on the three datasets. On the BRCA dataset, compared with MPL and GCN, ERGCN yielded a 2.48% and 1.67% improvement to Accuracy values, 4.54% and 3.38% improvement to F1 Score values, and a 3.64% and 2.53% improvement to MCC values. Similar results were observed on the GBM and LUNG data sets. ERGCN also produced higher Accuracy, F1 Score and MCC values than its variations, MPL and GCN, on the two data sets, except a 0.4% lower F1 Score than MPL on the LUNG dataset. The observed improvement in the performance of ERGCN suggests that ERGCN successfully improves cancer subtype classification by conducting a residual graph convolutional operation on the sample similarity network.

Analyzing Key Genes of Breast Cancer Subtypes
In this part, we further probed the key genes that may cause the occurrence of breast cancer subtypes. According to Yang's [16] approach, we leveraged ERGCN and the Random Forest model to determine the key genes of breast cancer subtypes. First, we ran the ERGCN model to predict a cancer subtype label for every sample in the BRCA dataset. Then we regarded the gene expression profile as the features of a sample and fitted a Radom Forest model with the sample features and their labels predicted by ERGCN. The Gini index measured the feature importance. We obtained 50 key genes with the highest Gini index scores which are essential in breast cancer subtype classification. To further investigate the function of these 50 key genes, we compared them with differentially expressed genes and found 37 key genes were differentially expressed. The differentially expressed genes were obtained using the R function TCGAanalyze-DEA, which compares tumor samples and normal solid tissue samples with the parameters dr.cut = 0.01 and logFC.cut = 1. Next, we employed the R package clusterProfiler [44] to perform GO and KEGG enrichment analysis for these 50 key genes (see Figure 5). GO enrichment studies the selected genes from three aspects: biological process (BP), cell composition (CC) and molecular function (MF).
In breast cancer, as for biological process, most of the selected genes were enriched in the regulation of gland development, epithelial cell development, gonad development, mammary gland epithelium development and primary sexual characteristics. For cellular components, the selected genes were mainly concentrated on the chromosomal region, cytoplasmic region and kinetochorer. For molecular function, the chosen genes were mostly located in DNA-binding transcription activator activity, RNA polymerase ll-specific, steroid binding and transcription coregulatori binding. The KEGG pathway analysis illustrated that most of the selected genes were enriched in the progesterone-mediated oocyte maturatiorl, oocyte meiosis, chemical carcinogenesis receptor activator and cellular senescence. expressed genes and found 37 key genes were differentially expressed. The differentially expressed genes were obtained using the R function TCGAanalyze-DEA, which compares tumor samples and normal solid tissue samples with the parameters dr.cut = 0.01 and logFC.cut = 1. Next, we employed the R package clusterProfiler [44] to perform GO and KEGG enrichment analysis for these 50 key genes (see Figure 5). GO enrichment studies the selected genes from three aspects: biological process (BP), cell composition (CC) and molecular function (MF).

Discussion
This study designed a novel deep learning model, ERGCN, to classify cancer subtypes. Our contributions mainly lie in two aspects. One is that we considered the interactions between samples to make predictions. The other is that we introduced residuals to the end-to-end GCN model to avoid over-smoothing and to strengthen the original samples' features. The observed improvement in the performance of ERGCN compared with other existing methods without using sample interactions suggests that interactions between samples contain rich and valuable information for cancer sample classification. Moreover, ERGCN performs best with a fair number of sample interactions (see the determination of correlation coefficient threshold). Our model also performed better than the GCN+PPI model by considering gene associations for cancer subtype classification. ERGCN outperformed two variations of our model, MLP and GCN, which proves that conducting a residual graph convolutional operation on the sample similarity network contributes to the prediction task. We also noticed that the models based on the autoencoder plus classifiers showed relatively lower performance. The two-stage method may not effectively learn features for cancer subtype identification compared with the end-to-end methods. In practice, we want to know which subtype a cancer sample belongs to, and to probe the subtype-related genes. We can combine ERGCN and the Random Forest model to determine the essential genes of a cancer subtype.

Conclusions
This work developed a cancer subtype identification method based on the residual graph convolutional network model. Firstly, we regarded gene expression profiles as sample features and constructed a sample similarity network according to their gene coexpression pattern. Next, we put the network and sample features into a residual graph convolutional network model to obtain the cancer subtype classification. Our method was applied to the three data sets of BRCA, GBM, and LUNG. The results show that our method was significantly better than other existing methods in terms of the internal and external evaluation indicators. We can see the stability of our model through the results of the new sample experiment. The survival of subtypes detected by our model differs considerably. The ablation study showed that the ERGCN combining the GCN model and residual architecture leads to higher performance than all its variants. Moreover, by analyzing the prediction results of breast cancer, we find some key genes that may cause the occurrence of breast cancer subtypes, demonstrating the practicality of ERGCN. However, our model still has some limitations. On the one hand, our model relies on a predefined threshold to construct the sample similarity network. Too high or too low threshold values may affect the performance of our model. However, it is hard to set proper threshold automatically. On the other hand, we only used gene expression profiles to classify samples. Other omics data may provide complementary information for cancer subtype classification [45]. Hence, our future work will design a more practical approach to stratify cancer subtypes by integrating multi-omics data.