Development and Validation of a Seven-Gene Signature for Predicting the Prognosis of Lung Adenocarcinoma

Background Prognosis is a main factor affecting the survival of patients with lung adenocarcinoma (LUAD), yet no robust prognostic model of high effectiveness has been developed. This study is aimed at constructing a stable and practicable gene signature-based model via bioinformatics methods for predicting the prognosis of LUAD sufferers. Methods The mRNA expression data were accessed from the TCGA-LUAD dataset, and paired clinical information was collected from the GDC website. R package “edgeR” was employed to select the differentially expressed genes (DEGs), which were then used for the construction of a gene signature-based model via univariate COX, Lasso, and multivariate COX regression analyses. Kaplan-Meier and ROC survival analyses were conducted to comprehensively evaluate the performance of the model in predicting LUAD prognosis, and an independent dataset GSE26939 was accessed for further validation. Results Totally, 1,655 DEGs were obtained, and a 7-gene signature-based risk score was developed and formulated as risk_score = 0.000245∗NTSR1 + (7.13E − 05)∗RHOV + 0.000505∗KLK8 + (7.01E − 05)∗TNS4 + 0.000288∗C1QTNF6 + 0.00044∗IVL + 0.000161∗B4GALNT2. Kaplan-Meier survival curves revealed that the survival rate of patients in the high-risk group was lower in both the TCGA-LUAD dataset and GSE26939 relative to that of patients in the low-risk group. The relationship between the risk score and clinical characteristics was further investigated, finding that the model was effective in prognosis prediction in the patients with different age (age > 65, age < 65) and TNM stage (N0&N1, T1&T2, and tumor stage I/II). In sum, our study provides a robust predictive model for LUAD prognosis, which boosts the clinical research on LUAD and helps to explore the mechanism underlying the occurrence and progression of LUAD.


Introduction
Lung cancer is a kind of malignant tumor with the morbidity (13% both in male and female) and mortality (24% in male and 23% in female), respectively, ranking second and top worldwide, according to the latest data released in A Cancer Journal for Clinicians [1]. Lung cancer can be classified into small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC), of which NSCLC sufferers are in the majority of the total lung cancer cases (around 80%). Lung adenocarcinoma (LUAD), the main histological subtype of NSCLC, takes up over 40% among the overall lung cancer morbidity [2]. Given that around 80% of patients with lung cancer are diagnosed in middle and advanced stages, surgery is no more an available option, resulting in unfavorable outcomes with a 5-year overall survival (OS) rate of nearly 17% [3,4]. While distant metastasis and relapse are main causes of poor cancer treatment and prognosis [5,6], identification of cancer-associated genes and independent prognostic factors as well as investigation of their impact on tumor progression and prognosis is beneficial for the implementation of precision medicine and helps to raise the cure rate and improve the prognosis. With the development of gene chip technology and RNA sequencing, gene expression profiles have been widely applied in the prediction of LUAD prognosis. For example, PHLPP2 has been reported as a novel biomarker in NSCLC metastasis and prognosis [7]. Thyroid transcription factor-1 is considered as a prognostic marker indicating the presence or absence of EGFRsensitizing mutations in stage IV LUAD [8]. And the elevated CX3CL1 mRNA expression is found to be a positive factor involved in LUAD prognosis [9]. However, due to the variety of methods, experimental platforms, batch effects, or other factors, discrepancy appears in the genes screened for prognosis prediction. Besides, the prognostic models constructed might be only practicable in the current experimental samples, while the performance in other independent datasets is less pronounced. Therefore, it is urgent to find a model that is practicable in various datasets, making its value realized in different clinical researches.
In the present study, HTSeq-Counts data of LUAD comprising 522 tumor samples and 58 normal samples were accessed from the TCGA database. Based on the data, survival-associated genes were selected using univariate COX regression analysis, after which the Lasso regression model was constructed to rule out the genes of a relatively stronger correlation to prevent model overfitting. Afterwards, a series of multivariate COX regression models were established, and the optimal model was identified in line with the Akaike Information Criterion (AIC). To validate and evaluate the performance of the model in predicting LUAD prognosis, various aspects were taken into account, finding that the model was effective in the training set and testing set, and its performance in patients with different age and TNM stage was validated to be good as well. Furthermore, the model also exhibited a good ability in predicting the prognosis of LUAD patients in an independent dataset GSE26939. To sum, our study constructs a robust gene signature-based model available for predicting the prognosis of LUAD patients, which helps the clinical research on LUAD and lays a foundation for the future investigation on the molecular mechanism underlying LUAD occurrence and progression.

Methods and Materials
2.1. Data Collection and Preprocessing. HTSeq-Counts data of LUAD (including 522 tumor samples and 58 normal samples) were obtained from the TCGA database (https://portal.gdc .cancer.gov/) and then used for differential analysis with the aid of R package "edgeR" (|logFC | ≥2, adj. FDR < 0:05). The corresponding clinical information of TCGA-LUAD patients was collected in the GDC website (https://portal.gdc.cancer .gov/). Patients who were followed up less than 30 days were excluded in this study, and totally, 460 TCGA-LUAD patients were included eventually. Besides, to further verify the validity of the prognostic model, an independent dataset GSE26939 (including 115 patients with LUAD) and matched clinical information were accessed from the GEO database (https:// www.ncbi.nlm.nih.gov/geo/).

Candidate Gene Selection.
Differentially expressed genes (DEGs) screened out by "edgeR" were randomized into the training set and testing set (5 : 5) and then subjected to univariate COX regression analysis for identifying the genes associated with the survival of patients with LUAD. The Lasso regression model was employed to further analyze these survival-related genes to exclude the genes with a relatively higher correlation, contributing to the decrease in the complexity of the prognostic model [10] and helping to find the optimal signature genes.
2.3. Prognostic Model Construction. Candidate genes selected by Lasso regression analysis were used to construct multivariate COX models, and the Akaike Information Criterion (AIC) was referenced to find the optimal prognostic model.

Stability and Validity Verification.
Patients in the training set and testing set were conferred a risk score and grouped into the high-risk group and the low-risk group based on the median score. The Kaplan-Meier method was conducted to compare the survival of patients in two groups, and log-rank was performed to calculate the p value. Meanwhile, ROC analysis was carried out to analyze the performance of the model in predicting the prognosis of LUAD patients, and an independent dataset GSE26939 was applied for the verification of the model's validity.

Identification of Candidate Genes.
In total, 1,655 DEGs were obtained via differential analysis based on the TCGA-LUAD dataset (Figure 1(a)) and randomly assigned to the training set and testing set (5 : 5). Univariate COX analysis was performed to screen survival-related genes from the training set with the cut-off set as p value = 0.01, and initially, 60 genes were screened out as shown in Supplementary Table 1 (the top 20 genes associated with survival are listed in Table 1). Subsequently, these genes were analyzed in a Lasso regression model. Genes with a relatively higher correlation were removed to lower the complexity of the prognostic model, and finally, 9 candidate signature genes were identified, namely, NTSR1, RHOV, KLK8, TNS4, C1QTNF6, FAM83A, IVL, B4GALNT2, and CREG2 (Figures 1(b) and 1(c)).

Evaluation of the 7-Gene Signature-Based Model in
Predicting the Survival of LUAD Patients. Based on the formula, the 7-gene signature-based risk score of each patient in the training set and testing set was calculated, and patients were classified into the high-risk group and the low-risk      (Figure 2(j)), respectively. Taken together, the 7-gene signature-based model was demonstrated to be capable of predicting the prognosis of LUAD patients.

Verification of Stability and Validity of the Prognostic
Model for LUAD with an Independent Dataset GSE26939. An independent dataset GSE26939 from the GEO database was applied to further verify the validity and stability of the 7-gene model. The same as the above procedures, patients were divided into the high-risk and low-risk groups based on the median risk score, and survival comparison was performed using Kaplan-Meier as shown in Figure 3(a), indicating the lower survival rate in the patients of the high-risk group (p < 0:05). Thereafter, ROC analysis was performed for further verification, with the AUC values of 1-, 3-, and 5-year survival of 0.667, 0.616, and 0.623 ( Figure 3(b)), respectively. Collectively, this 7-gene model was practicable in other independent datasets.

Prognostic Impact of the Model on Clinical Characteristics.
To further discuss the correlation of the 7-gene signaturebased risk score with the TNM (Tumor Node Metastasis) stage and overall survival (OS) of LUAD patients, matched clinical information of the training set and testing set was collected and is listed in Tables 3 and 4. The relationship between the risk score and TNM stage was explored, revealing that the risk score was significantly associated with pathologic T, N, and tumor stages of patients in both the training set (Figures 4(a)-4(c)) and testing set (Figures 4(d)-4(f)) (p < 0:05). Moreover, the performance of the model in predicting the prognosis of patients with different clinical characteristics in the two sets was investigated (Figure 4(g)), finding good performance on patients in different age and clinical stage (age > 65, age < 65, N0&N1, T1&T2, and tumor stage I/II). While in the independent dataset GSE26939, such correlation was less pronounced (Supplementary Table 2). Altogether, this 7-gene signature-based risk score model was a useful prognosis predictor in patients with different clinical characteristics and could be served as a novel biomarker in LUAD treatment.

Discussion
Lung cancer, with its mortality ranking top globally, often appears to be in middle and advanced stages when being initially diagnosed in most patients, and surgery is no more   [11]. Park et al. tried to construct a gene signature-based prognostic model for LUAD [12], and in 2016, Shukla et al. proposed the first RNA-seq-based prognostic signature through analyzing the RNA-seq and clinical data, making an attempt to develop a potent predictive tool for LUAD prognosis [13]. Despite the extensive research on signature genes used for LUAD prognosis, models with robust prediction capability have yet to be successfully constructed. Besides, with the development of high-throughput sequencing, more gene datasets of LUAD should be employed into new studies. In our study, seven LUAD survival-related genes were identified, including NTSR1, RHOV, KLK8, TNS4, C1QTNF6, IVL, and B4GALNT2. These 7 signature genes were obtained from the HTSeq-Counts in the TCGA-LUAD dataset using univariate COX, Lasso regression, and multivariate COX analyses. Sequentially, the risk score based on the 7-gene signature was established and formulated as risk score = 0:000245 * NTSR1 + ð7:13E − 05Þ * RHOV + 0:000505 * KLK8 + ð7:01E − 05Þ * TNS4 + 0:000288 * C1 QTNF6 + 0:00044 * IVL + 0:000161 * B4GALNT2. As reported, most of these 7 genes are closely related to cancer progression. For example, NTSR1 (Neurotensin Receptor 1) has been reported as a potential prognostic biomarker for surgically resected stage I LUAD [14] and prostate cancer [15]. RHOV (Ras Homolog Family Member V) has been verified to be highly expressed in NSCLC and can serve as a signature gene in LUAD prognosis [16]. KLK8 (Kallikrein Related Peptidase 8) has presented its research value in the prognosis of various cancers, such as lung cancer [17], ovarian cancer [18], breast cancer [19], colon cancer, and rectal cancer [20]. Moreover, TNS4 (Tensin 4) has been found to be upregulated in LUAD and able to predict poor prognosis, and it has been observed to be mediated by miR-150-3p [21]. Meanwhile, another study indicated that the aberrant methylation of TNS4 is significantly associated with the OS of LUAD patients [22]. C1QTNF6 (C1q/tumor necrosis  factor-related protein 6), a member of the CTRP family, has shown its potential as an independent predictor for the prognosis of LUAD sufferers [23]. Additionally, although the role of B4GALNT2 (Beta-1,4-N-Acetyl-Galactosaminyltransferase 2) in LUAD has not been investigated, it has been observed to be highly related to gastric cancer metastasis [24]. However, the association between IVL (Involucrin) and the progression of LUAD has not been reported, which requires further study in the future. In view of the above studies, we could conclude that some of these signature genes exhibit a certain relationship with the prognosis of other cancers.
During the research, each patient in the training set and testing set was conferred a risk score and classified into the high-risk group and the low-risk group according to the median score. As suggested in OS curves, patients in the high-risk group had poorer survival. ROC curves were plotted, and the AUC values of 1-, 3-, and 5-year survival in two sets were all above 0.6, indicating that the 7-gene signature-based risk score model was capable of predicting LUAD prognosis. Notably, similar results were found in an independent dataset GSE26939 from the GEO database, demonstrating the validity and practicality of this 7-gene model. Furthermore, the association between this model    BioMed Research International and clinical characteristics of LUAD patients was explored, finding that the model functioned well in predicting the prognosis of patients with different age (age > 65, age < 65) and TNM stage (N0&N1, T1&T2, and tumor stage I/II), while the effect in the GSE26939 was less remarkable.

Conclusions
In conclusion, we obtained 1,655 DEGs from the TCGA-LUAD dataset using the "edgeR" package and constructed a prognostic 7-gene signature-based model (containing NTSR1, RHOV, KLK8, TNS4, C1QTNF6, IVL, and B4GALNT2, seven genes) through univariate COX, Lasso, and multivariate COX regression analyses. The robust model we built helps to advance the clinical research on LUAD and better understand the mechanism underlying LUAD occurrence and progression.

Data Availability
All the data in my manuscript is available.

Conflicts of Interest
The authors declare no conflicts of interest.