A Signature of 14 Long Non-Coding RNAs (lncRNAs) as a Step towards Precision Diagnosis for NSCLC

Simple Summary Although the biological function of lncRNAs has not been fully elucidated, we know that the aberrant expression of lncRNAs can drive the cancer phenotype. Therefore, a growing area of research is focusing on lncRNAs as putative diagnostic biomarkers and therapeutic targets. The aim of the study was the appraisal of the diagnostic value of 14 differentially expressed lncRNA in the early stages of NSCLC. We established two classifiers. The first recognized cancerous from noncancerous tissues, the second successfully discriminated NSCLC subtypes (LUAD vs. LUSC). Our results indicate that the panel of 14 lncRNAs can be a promising tool to support a routine histopathological diagnosis of NSCLC. Abstract LncRNAs have arisen as new players in the world of non-coding RNA. Disrupted expression of these molecules can be tightly linked to the onset, promotion and progression of cancer. The present study estimated the usefulness of 14 lncRNAs (HAGLR, ADAMTS9-AS2, LINC00261, MCM3AP-AS1, TP53TG1, C14orf132, LINC00968, LINC00312, TP73-AS1, LOC344887, LINC00673, SOX2-OT, AFAP1-AS1, LOC730101) for early detection of non-small-cell lung cancer (NSCLC). The total RNA was isolated from paired fresh-frozen cancerous and noncancerous lung tissue from 92 NSCLC patients diagnosed with either adenocarcinoma (LUAD) or lung squamous cell carcinoma (LUSC). The expression level of lncRNAs was evaluated by a quantitative real-time PCR (qPCR). Based on Ct and delta Ct values, logistic regression and gradient boosting decision tree classifiers were built. The latter is a novel, advanced machine learning algorithm with great potential in medical science. The established predictive models showed that a set of 14 lncRNAs accurately discriminates cancerous from noncancerous lung tissues (AUC value of 0.98 ± 0.01) and NSCLC subtypes (AUC value of 0.84 ± 0.09), although the expression of a few molecules was statistically insignificant (SOX2-OT, AFAP1-AS1 and LOC730101 for tumor vs. normal tissue; and TP53TG1, C14orf132, LINC00968 and LOC730101 for LUAD vs. LUSC). However for subtypes discrimination, the simplified logistic regression model based on the four variables (delta Ct AFAP1-AS1, Ct SOX2-OT, Ct LINC00261, and delta Ct LINC00673) had even stronger diagnostic potential than the original one (AUC value of 0.88 ± 0.07). Our results demonstrate that the 14 lncRNA signature can be an auxiliary tool to endorse and complement the histological diagnosis of non-small-cell lung cancer.

It is also believed that the development of the robust diagnostic DElncRNA signature could support the routine histopathological examination of patients' samples. Lin Y et al. [16] showed that two circulating lncRNAs: SNHG1 and RMRP distinguished lung cancer patients from cancer-free controls. Similarly, Yuan S et al. [17] constructed a panel of four circulating lncRNA (RMRP, NEAT1, TUG1, and MALAT1) that improved the diagnostic capacity in both LUAD and LUSC.
Despite advances in the understanding of the cancer genome, a great number of lncRNAs remain "the dark matter". The biological functions of lncRNAs and underlying molecular mechanisms have not been fully revealed. Furthermore, there is a lack of strong evidence advocating that lncRNAs can be translated from bench to bedside [4]. To provide novel insight to precision diagnosis, we evaluated the expression of 14 lncRNAs (HAGLR, ADAMTS9-AS2, LINC00261, MCM3AP-AS1, TP53TG1, C14orf132, LINC00968, LINC00312, TP73-AS1, LOC344887, LINC00673, SOX2-OT, AFAP1-AS1, and LOC730101) in the early stages of non-small-cell lung cancer (NSCLC). Based on the PubMed database, we chose lncRNAs that (1) were shown to be differentially expressed in NSCLC; (2) were detected to be expressed in cell lines or tissues of NSCLC; (3) were noticed to have a potential role in the pathogenesis; and (4) were shown to be expressed by the established molecular methods (e.g., microarray, quantitative PCR). The expression level of selected molecules was measured by a quantitative polymerase chain reaction (qPCR). Then, the acquired Ct and delta Ct values were used to build a logistic regression classifier that successfully discriminated tumor from non-tumor tissue. However, in a more complex task of distinguishing NSCLC subtypes, a gradient boosting decision tree classifier (GBDT) was used to create the final simplified regression model based on four lncRNAs (AFAP1-AS1, SOX2-OT, LINC00261, and LINC00673) that was found to be a robust predictor of NSCLC subtype (AUC value of 0.88 ± 0.07).

Patients and Samples
The study was conducted, in the frame of the Polish STRATEGMED-2 and MINIATURA-2 projects, on 92 paired snap-frozen tumors and matched unaffected lung tissue collected from the lungs of IA-IIB NSCLC patients in the Department of Thoracic Surgery Medical University of Bialystok and in the Department of Thoracic Surgery Poznan University of Medical Sciences. None of the patients was treated with chemo-or radiotherapy before surgery. The informed consent for specimen collection and clinicopathological data processing was signed. The study design was approved by the Bioethics Committee of the Medical University of Bialystok (ethical approval codes: R-I-002/357/2014 and R-I-002/343/2018). Prior to RNA extraction, the cross-sections of frozen tissue samples were stained with hematoxylin-eosin and evaluated by a pathologist (J.R.) to confirm the content of cancer cells. Tumor specimens with a high percentage of the malignant cells (above 70% of tumor cells on a microscopic section) and normal lung epithelium without metaplasia or dysplasia were used for downstream application (tumor tissue and normal tissue, respectively).

RNA Extraction and Quantitative Real-Time PCR
Total RNA, including the fraction of lncRNA, was isolated from tumor and adjacent normal lung tissue using the mirVana™ PARIS™ Kit (Thermo Fisher Scientific, Carlsbad, CA, USA), according to the manufacturer's protocol. The method comprises an organic extraction followed by immobilization of RNA on glass-fiber filters. RNA concentration was measured on NanoDrop 2000c (Thermo Fisher Scientific, Carlsbad, CA, USA). RNA Integrity Number (RIN) was evaluated on a 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA). The samples with RIN below six were automatically excluded. TR2 First Strand Kit (Qiagen, Germantown, MD, USA) was used for cDNA synthesis and for elimination of genomic DNA from RNA elutes, following producer's recommendations. The level of expression of 14 lncRNAs and a reference gene (GAPDH) was accessed by quantitative Real-Time PCR (qPCR). The assays were designed against the combined NCBI RefSeq and Ensembl  Table 1. The amplification was performed in triplicates with LightCycler 480 (Roche, Basel, Switzerland) according to RT2 lncRNA qPCR Assays protocol (Qiagen, Germantown, MD, USA). To increase the accuracy and repeatability of pipetting, Agilent Bravo Automated Liquid Handling Platform (Agilent, Santa Clara, CA, USA) was applied. The variation between qPCR runs was compensated by the application of the interplate calibrator (IPC). Raw data (Ct), after IPC correction, was normalized according to the formula: dCt = Ct lncRNA gene-Ct reference gene (GAPDH). Relative quantification (RQ) was calculated as follows: 2ˆ(−ddCt) where: ddCt = dCt tumor-dCt normal.

Statistical Analysis
The differences in lncRNA expressions between the tumor and unaffected lung tissues were analyzed with the Wilcoxon matched-pairs signed-ranks test. The differences in lncRNA expression between NSCLC histological subtypes (LUAD tumor tissue vs. LUSC tumor tissue) were calculated with the Kruskal-Wallis rank tests. The Wilcoxon rank-sum (Mann-Whitney U) or Kruskal-Wallis rank tests were used to analyze the associations between clinicopathological variables and lncRNAs expression. After the Bonferroni correction, p ≤ 0.003 was considered to indicate statistically significant differences. The statistical analyses were performed using STATA/SE 11.1 software (Stata Corporation, College Station, TX, USA).
Two binary logistic regression classifiers were established to appraise the diagnostic value of 14 lncRNAs. The first was created to differentiate cancerous and noncancerous samples. The latter has focused on the discrimination of LUSC tumor tissue from LUAD tumor tissue. In both models, as an input, Ct and delta Ct values were used. The models were evaluated with a fivefold stratified cross-validation. Samples were randomly assigned to the training sets and test sets. In each set, the classes were balanced. The random assignations to sets were repeated 100 times in total. For each step of cross-validation, the accuracy, recall, f1, a receiver operating characteristic (ROC) curve, and the area under the ROC (AUC) were established and results were expressed as mean and 95% CI.  In order to validate the logistic regression metrics for discriminating between LUSC and LUAD, a more flexible gradient boosting decision tree classifier was built. The model was evaluated accordingly and then analyzed with Shapley Additive exPlanations (SHAP) [35,36] to quantify the importance of specific Ct and delta Ct values on the model's predictions. Finally, the four most important variables were used as an input to establish a simplified logistic regression classifier, which was an improved version of the original model. The following Python packages were used in the statistical analysis: scikit-learn v1.0.1 [37] for the logistic regression classifier and model evaluation, lightgbm v3.3.1 [38] for the gradient boosting decision tree classifier, shap v0.40.0 [36], and dalex v1.4.1 [39] for the explanatory model analysis.

Correlation of lncRNAs Expression and Clinicopathological Characteristics of NSCLC Patients
The expression of 13 out of 14 lncRNAs was linked with the histological subtype of NSCLC. The majority of the studied lncRNAs have not shown any significant connection with the other clinicopathological features of patients. We observed that the expression of 3 out of 14 lncRNAs (LINC00673, MCM3AP-AS1, C14orf132) was associated with sex, the expression of 3 out of 14 lncRNAs (LOC344887, MCM3AP-AS1, LOC730101) was connected with death, and the expression of SOX2-OT was linked with the smoking status. Recurrence-free survival (RFS) and TNM have not been affected by the expression of any of 14 lncRNAs (Table 5).

Diagnostic Value of lncRNAs in NSCLC Patients Based on Logistic Regression Classifiers and Gradient Boosting Decision Tree
The first logistic regression model for cancerous and noncancerous tissue separation has shown a strong diagnostic potential (AUC value of 0.98 ± 0.01) (Figure 2A). All of the mean model metrics were above 0.9, indicating that lncRNAs successfully distinguished NSCLC tumor tissue from noncancerous lung tissue ( Figure 2B). In the second logistic regression classifier for subtype discrimination, the mean area under the ROC curve (AUC) was 0.84 ± 0.09 ( Figure 2C). All mean metrics were weaker than in the first model, but still above 0.75, suggesting that 14 lncRNAs taken together can potentially distinguish LUSC from LUAD tissue ( Figure 2D). . Metrics for logistic regression classifier for differentiating normal tissue from tumor tissue and LUSC tissue from LUAD tissue: (A) Mean ROC ± SD curve, and AUC for logistic regression classifier for differentiating normal tissue from cancer tissue; (B) Mean and 95% CI for precision, recall, f1-score, and AUC metrics for logistic regression classifier for differentiating normal tissue from cancer tissue; (C) Mean ROC ± SD curve, and mean AUC for logistic regression classifier for differentiating LUSC tissue from LUAD tissue; (D) Mean and 95% CI of precision, recall, f1-score metrics, and AUC for logistic regression classifier for differentiating LUSC tissue from LUAD tissue.

Figure 2.
Metrics for logistic regression classifier for differentiating normal tissue from tumor tissue and LUSC tissue from LUAD tissue: (A) Mean ROC ± SD curve, and AUC for logistic regression classifier for differentiating normal tissue from cancer tissue; (B) Mean and 95% CI for precision, recall, f1-score, and AUC metrics for logistic regression classifier for differentiating normal tissue from cancer tissue; (C) Mean ROC ± SD curve, and mean AUC for logistic regression classifier for differentiating LUSC tissue from LUAD tissue; (D) Mean and 95% CI of precision, recall, f1-score metrics, and AUC for logistic regression classifier for differentiating LUSC tissue from LUAD tissue.
Improving the second logistic regression model a gradient boosting decision tree (GBDT) was created. This increased the AUC value up to 0.88 ± 0.08 with other mean metrics above 0.75 ( Figure 3A,B). The better results were attributed to capturing the interactions between the variables and dealing better with the higher volume of noise in data ( Figure 1B). Explanatory analysis of GBDT with SHAP highlighted the significance of variables for discriminating between LUSC and LUAD ( Figure 4). From the explanation, it can be observed that the three most important variables used in the discrimination were delta Ct AFAP1-AS1, Ct SOX2-OT, and Ct LINC00261, while, for example, the expression of LINC00312, HAGLR, and ADAMTS9-AS2 were of little use to the model. Moreover, the expression profile of the most important lncRNAs, indicated with the red-to-blue and blueto-red distinction, has been consistent with the results presented in Table 4 and Figure 1. The final simplified logistic regression model was built using only the top-four variables ( Figure 3C) and showed the best diagnostic potential with the AUC value of 0.88 ± 0.07 and other mean metrics above 0.8 ( Figure 3B).
Our findings pinpointed that the panel of 14 lncRNAs can be a novel and useful tool for both discrimination malignant from non-malignant lung tissue and for NSCLC subtyping.

Discussion
The focus of the study was on the possibility of the application of lncRNAs for precise diagnostic decision-making in NSCLC patients. We anticipated that the panel of 14 lncRNAs could distinguish between cancerous and noncancerous lung tissues as well as NSCLC subtypes (LUAD vs. LUSC). Therefore, we hypothesized that our lncRNA signature could be implemented along with histopathology to increase the accuracy of NSCLC diagnosis.
The under-expressed ADAMTS9-AS2, LINC00261, and LINC00312 in our studies were in concordance with the publications underlining their tumor-suppressive potential. The experimentally induced overexpression of these molecules inhibited proliferation, invasion, and migration of tumor cells. Liu C et al. [19] augmented expression of ADAMTS9-AS2 that narrowed the size of lung tumor, arrested lung cancer cells in G1/G0 phase and promoted apoptosis. Moreover, ADAMTS9-AS2 inhibited miR-223-3p and activated TGFBR3. Guo C et al. [22] induced LINC00261 expression that regulated miR-1269a/FOXO1 axis. By the same token, Wang Z et al. [23] showed that LINC00261 suppressed miR-105/FHL1 axis, Shi J et al. [24] demonstrated that the molecule sponged miR-522-3p leading to the overexpression of SFRP2 that inhibited Wnt signaling and Liao J et al. [47] showed that LINC00261 restrained the epithelial-mesenchymal transition (EMT) of NSCLC via downregulating Snail. For the next molecule, LINC00312, the data were less straightforward. Tan Q et al. [48] presented that downregulation of LINC00312 was connected with advanced stage NSCLC, whereas, Tian Z et al. [49] concluded that it was linked with the I stage of LUAD.
We showed that C14orf132 and LINC00968 were downregulated in tumors (NSCLC, LUAD and LUSC) in comparison to matched normal lung tissue and not statistically significant in LUAD tumor vs. LUSC tumor. Yu H et al. [1] identified C14orf132 and LINC00968 as two of the most downregulated lncRNAs in NSCLC compared with normal lung tissue. The biological function of C14orf132 has not yet been revealed. LINC00968, on the other hand, was shown to be a part of a two-signaling axis in LUAD: miR-9-5p/CPEB3 [28] or miR-21-5p/SMAD7 [29]. The molecule also networked three miRNAs (miR-9, miR-22 and miR-4536) and two hub genes (PLK1 and XPO1) [50]. In LUSC, LINC00968 affected the outcome of patients via regulating the MAPK signaling pathway [51].
In our experiment, MCM3AP-AS1 and TP53TG1 were downregulated in NSCLC and LUSC tumors in comparison to normal lung tissue. However, in LUAD tumor vs. LUSC tumor, MCM3AP-AS1 was upregulated, while TP53TG1 did not show any statistically significant differences. Li X et al. [25] and Shen D et al. [26] noticed the elevated expression of MCM3AP-AS1 in NSCLC that accelerated cancer progression. The molecule targeted either miR-340-5p/KPNA4 axis or miR-195-5p/E2F3 axis. However, Xiao H et al. [27] detected the depleted expression of TP53TG1 in NSCLC and cell lines. The experimentally induced upregulation of TP53TG1 modulated the miR-18a/PTEN axis and enhanced cisplatin sensitivity and apoptosis of cancer cells.
In our study, LINC00673 was upregulated in tumors of all considered groups (NSCLC, LUSC, and LUAD) in comparison to adjacent noncancerous lung tissue as well as upregulated in LUAD tumor vs. LUSC tumors. The results overlapped with those previously published. LINC00673, a key regulator of signaling pathways, was shown to promote invasion and migration of NSCLC cells [52][53][54]. Ma C et al. [52] described that LINC00673 participated in epigenetic silencing of HOXA5 via binding EZH2. Lu W et al. [53] observed that the molecule sponging miR-150-5p modulated the expression of ZEB1, a key regulator of EMT, indirectly. Guan H et al. [54] indicated that LINC00673 led to activation of WNT/β-catenin signaling and consequently promoted aggressiveness of LUAD.
In our findings, LOC344887 was upregulated in NSCLC tumor and LUSC tumor compared with normal lung tissue and in LUSC tumor vs. LUAD tumor. The differences in the expression level between LUAD and normal lung tissue were not significant. Similarly considering SOX2-OT, we have not noticed significant differences in both NSCLC and LUAD tumor vs. normal lung tissue. However, the upregulation of SOX2-OT was statistically significant in LUSC tumor vs. both normal lung tissue and LUAD tumor (p < 0.003). In the literature, both LOC344887 and SOX2-OT were shown to be upregulated in NSCLC tissues compared with normal lung tissues. This state was connected with adverse outcomes (lymph node metastasis, advanced stage, and poorer differentiation) [20,21]. Wu B et al. [20] observed the increased expression of LOC344887 in NSCLC compared with adjacent normal tissue, which was connected with shorter overall survival time of patients. Chen Z et al. [55] presented that SOX2-OT modulated miR-30d-5p/PDK1 axis, which promoted proliferation, migration, and invasion of NSCLC cells. Herrera-Solorio AM et al. [21] noticed that SOX2-OT affected the expression of EGFR-pathway members AKT/ERK, which was linked with a poor clinical prognosis.
In our studies, the expression of AFAP1-AS1 in the tumor vs. normal lung tissue was not statistically significant in NSCLC, downregulated in LUSC, and upregulated in LUAD. As far as only tumor tissues were concerned, AFAP1-AS1 was upregulated in LUAD vs. LUSC. According to the literature, AFAP1-AS1 was upregulated in both NSCLC and LUAD compared with matched non-tumor tissues. The overexpression of the gene was an unfavorable factor for patients [56,57]. Zhong Y et al. [58] experimentally confirmed that AFAP1-AS1 promoted lung cancer metastasis. The molecule interacted with SNIP1, inhibiting the degradation of c-Myc protein that activated the expression of ZEB1, ZEB2, and SNAIL. The downstream targets of c-Myc enhanced EMT and consequently the metastatic potential of the cells. Other studies have shown that AFAP1-AS1 repressed epigenetically of p21 [59] and HBP1 [60] and upregulated IRF7 and the RIG-I-like receptor signaling pathways [61]. In addition, Huang N et al. [62] observed that AFAP1-AS1 promoted cellular chemotherapy resistance indirectly. The molecule sponged miR-139-5p, leading to the upregulation of RRM2 that activated EGFR/AKT. Not only was the gene involved in epigenetic regulation, but also its own expression was silenced by DNA methylation [63].
Within 14 evaluated lncRNAs, only LOC730101 expression was not statistically significant in all of the studied groups. Liu L et al. [34] reported that the overexpression of LOC730101 increased the proliferative phenotype of lung cancer cells and was posi-tively correlated with CCND1 and CCNE1, the downstream targets of the Wnt/b-catenin signaling pathway.
The discrepancies between the literature and our study may arise from the tumor biology and different researchers' approaches to the experiment layout. Lung cancer is highly heterogeneous, with the presence of inflammatory changes in the non-neoplastic surroundings [64]. Individual and population features also influence the tumor. To minimize the impact of variabilities on our studies, pathologically assessed fresh-frozen tissues were used. The samples enriched in malignant cells (ranged from 70 to 100%) were classified as "tumor" whereas those without tumor cells were classified as "normal." In contrast to our experiment, the majority of studies utilized cell cultures, a homogenous population deprived of influences from the environment, which therefore does not directly mirror the dynamics of genetic changes observed inside the organs and tissues undergoing cancer genesis [19,42,44,45]. Another aspect that characterized our research was the exclusion of undifferentiated large cell carcinoma (LCLC) cases, due to low prevalence in the population (2.9-9%) and difficulties in obtaining a representative group [65]. There is no doubt that the pattern of gene expression in LCLC may be different from those in LUAD and LUSC and change the overall picture of gene expression in NSCLC.
A further source of inconsistencies could be associated with the restriction of our experiment to the early stages of NSCLC (IA-IIB), whereas others extended their interest to more advanced stages (III-IV) [1,41]. In the course of tumorigenesis, the gene expression changes. LncRNAs with oncogenic potential are more than likely to be downregulated in the early stages and upregulated as the disease progresses and local or distant metastases occur. That is why our results might not be in line with those obtained by researchers considering different stages of NSCLC.
The two binary regression classifiers established by us, based on all 14 lncRNAs, distinguished cancerous from noncancerous lung tissues and NSCLC subtypes (AUC of 0.98 ± 0.01 and 0.84 ± 0.09, respectively) with high precision, even if the expression of a few molecules was statistically insignificant (SOX2-OT, AFAP1-AS1, and LOC730101 for tumor vs. normal tissue; and TP53TG1, C14orf132, LINC00968, and LOC730101 for LUAD vs. LUSC). To expand our knowledge on the importance of specific lncRNAs toward discriminating LUAD from LUSC tumor types, Shapley Additive Explanation of a gradient boosting decision tree was used. This one-step procedure of variable selection explaining a more complex classifier allowed us to build a very simple logistic regression model of high performance (AUC of 0.88 ± 0.07). The popularity of tree-based models, and especially interpreting their predictions, has seen a major impact across the medical domain [35]. The visible advantage of this approach presented in our study was accurately dealing with noisy data, which improved performance with an even more interpretable model, consisting of only 4 out of 28 variables (Ct and delta Ct values for all 14 lncRNAs).
In summary, our results demonstrate that 14 lncRNAs (HAGLR, ADAMTS9-AS2, LINC00261, MCM3AP-AS1, TP53TG1, C14orf132, LINC00968, LINC00312, TP73-AS1, LOC344887, LINC00673, SOX2-OT, AFAP1-AS1, and LOC730101) had the potential to separate cancerous from non-cancerous lung tissue as well as LUAD from LUSC in the early stages of NSCLC. We believe that the application of the panel of 14 lncRNAs can be a step towards a precise diagnosis for the early stages of NSCLC.

Conclusions
In the set of molecules we analyzed, 11 DElncRNAs successfully discriminated NSCLC tissue from corresponding adjacent non-cancerous tissue. The expression of 10 out of 14 lncRNAs was subtype-related (LUAD tumor or LUSC tumor). Using binary logistic regression classifiers and a tree-based model, we not only distinguished tumors from non-tumor tissues but also confirmed histological subtyping. Our results showed that the 14 lncRNA signatures can be a promising auxiliary tool to endorse and complement histopathological assessment. Nevertheless, a validation step in an independent group of patients is required. The expansion of future projects to advanced stages of NSCLC is also reasonable, given that the changes in lncRNA expression can be cancer stage-related.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The datasets analyzed during the current study are available from the corresponding author on reasonable request.