Prediction of bone metastasis in non-small cell lung cancer based on machine learning

Objective The purpose of this paper was to develop a machine learning algorithm with good performance in predicting bone metastasis (BM) in non-small cell lung cancer (NSCLC) and establish a simple web predictor based on the algorithm. Methods Patients who diagnosed with NSCLC between 2010 and 2018 in the Surveillance, Epidemiology and End Results (SEER) database were involved. To increase the extensibility of the research, data of patients who first diagnosed with NSCLC at the First Affiliated Hospital of Nanchang University between January 2007 and December 2016 were also included in this study. Independent risk factors for BM in NSCLC were screened by univariate and multivariate logistic regression. At this basis, we chose six commonly machine learning algorithms to build predictive models, including Logistic Regression (LR), Decision tree (DT), Random Forest (RF), Gradient Boosting Machine (GBM), Naive Bayes classifiers (NBC) and eXtreme gradient boosting (XGB). Then, the best model was identified to build the web-predictor for predicting BM of NSCLC patients. Finally, area under receiver operating characteristic curve (AUC), accuracy, sensitivity and specificity were used to evaluate the performance of these models. Results A total of 50581 NSCLC patients were included in this study, and 5087(10.06%) of them developed BM. The sex, grade, laterality, histology, T stage, N stage, and chemotherapy were independent risk factors for NSCLC. Of these six models, the machine learning model built by the XGB algorithm performed best in both internal and external data setting validation, with AUC scores of 0.808 and 0.841, respectively. Then, the XGB algorithm was used to build a web predictor of BM from NSCLC. Conclusion This study developed a web predictor based XGB algorithm for predicting the risk of BM in NSCLC patients, which may assist doctors for clinical decision making


Introduction
Lung cancer, as one of the most common malignant tumors worldwide, has an annual incidence of 2 million and causes 1.76 million deaths each year (1). Non-small cell lung cancer (NSCLC) accounts for approximately 85% of lung cancer cases, which has an improving overall survival rate due to better therapy (2). However, bone metastasis (BM) is a negative prognostic factor in NSCLC patients. Studies have reported that the incidence of BM in patients with NSCLC is 26-36%, and the 2-year survival rate of patients with BM is 3% (3). Also, bone metastases often result in a range of complications, such as pain, hypercalcemia, spinal cord compression, pathological fractures and neurological defect, which will decrease the patient's quality of life (4).
Early diagnosis and intervention in patients with BM could significantly improve prognosis of patients. At present, bone scan is the most classic way to detect bone related diseases (5). But due to its low sensitivity to bone metastases, early BM from cancer is often not detected. Although studies had shown that PET-CT can improve the detection rate of small bone lesions (6), it is limited as a screening tool due to its high cost and high radiation. Therefore, bone scans and PET-CT are recommended only when there is a suspicious bone-related event, which usually occurs 5 months after BM (7). By then, many NSCLC patients may have developed multiple metastases, and the prognosis of patients is poor.
Previous studies (8)(9)(10) reported that histopathological type, gender, histological differentiation, serum CA-125, ALP, and multiple lymph node metastasis are independent risk factors for BM of lung cancer, which lays the foundation for prediction model construction. Zhang Chao et al (10). constructed a nomogram to predict the BM in different histological types of lung cancer based on the traditional logistic model. However, the limitations of this method in prediction accuracy and processing big data have made it difficult to make great breakthroughs in precision medicine (11,12). Therefore, advanced machine learning models were used in this study.
Compared with traditional logistical model, machine learning (ML) technology can unlock more information in large datasets to achieve the purpose of outcome prediction and have higher accuracy (12). There already are many applications of this technology throughout science and society ranging from driverless cars to Board games to decision-making (12,13). In biomedicine, the development of big data in healthcare (14,15) offers great potential for ML to understand disease and health and ML has been used in clinical diagnostics, precision therapeutics, and health monitoring (16).
Therefore, in this study, we aimed to found a machine learning algorithm with good prediction performance, and establish it as a web-based calculator that can be easily used to predict the risk of BM in NSCLC patients.

Study population
In the study, we used SEER*stat 8.4.0 software to download the patients' data from the SEER-Medicare database submitted in November 2020. Patients diagnosed with lung cancer between 2010 and 2018 were involved in this study. Exclusion criteria were detailed as follows: (1) The histological type of lung cancer is small cell cancer and unknown; (2) The information of T stage, N stage, race, grade, marital status, and bone metastatic status missed or unknown; (3) LC is not the first tumor. A study flow chart of case screening was presented in Figure 1. Additionally, 507 patients newly diagnosed with NSCLC were included in this study between January 2007 and December 2016 at the First Affiliated Hospital of Nanchang University.

Data selection
In this study, 11 variables related to the clinicopathology and demographics of patients were selected for analysis. Demographic variables included race, sex, marital and age. Clinicopathological variables included primary site, grade, histology, T stage, N stage, laterality and chemotherapy. According to the ICD-O-3 codes, histological types of NSCLC were divided into 5 categories, including adenocarcinoma (814-838), squamous cell carcinoma (805-808), adenosquamous carcinoma (856), large cell carcinoma (8012-8014) and others. All NSCLC patients were staged according the AJCC 7th edition guidelines and SEER staging information. In addition, we divided patients into two groups at 60 years to analyze the effect of age on outcome events by referring to the study of Zhou et al (8).
Data pre-processing and feature engineering All statistical analyses were conducted with Python3.8, SPSS 23 and R 4.2.0. In this study We performed a logistic regression analysis on data collected in the SEER database to identify suitable variables for machine learning model by using SPSS 23 software. Significant variables between BM and non-BM patients were identified by univariate logistic regression analysis (P<0.05). Then, these variables were enclosed within multivariate logistic regression analysis, and variables with a P < 0.05 in multivariate logistic regression analysis were subjected for further analysis of ML model. Correlation analysis was used to analyze the correlation among the selected features. Since this data set is an unbalanced data set, the over-sampling method were adopted for data processing (17). The key of this method is to oversampling the data samples of small classes to increase the number of data samples of small classes to improve the accuracy of the model.
For the 507 external samples, the cancer stage and grade were unified according to the AJCC 7th edition criteria, so that the parameters of the two data sets could be matched. Missing values were imputed by R mouse package using classification and regression tree principle (18). Meanwhile, to compare the importance of each feature, we extract the feature importance of each variable in the machine learning model according to the Permutation Importance principle (19).

Model establishment and evaluation
During the modeling establishment, SEER data set was used as internal data to build and validate the models, while hospital data were used as external validation data to validate and evaluate the predictive ability of the machine learning models. The risk stratification threshold of the model was set to 0.5(50%) (20).
Six commonly used classifier algorithms were chosen to this study, including three ensemble algorithms (12) (Random Forest (RF), Gradient Boosting Machine (GBM), eXtreme gradient boosting (XGB)) and three simple classification algorithms (Logistic Regression (LR), Decision tree (DT), Naive Bayes classifiers (NBC)). The ML models were trained using Python software. In the internal test, all SEER data were divided into 10 parts for 10-fold cross-validation (21). For external testing, external data was imported directly into the built model for verification. The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, accuracy and F-score were evaluated indicators of ML algorithms. Comparing the evaluation indexes of each model, the best model was selected to build a network predictor.

Demographic and pathological characteristics
In this study, 50581 patients diagnosed with NSCLC in the SEER database were included. Of whom, 5087 (10.06%) developed BM and 45494 (89.94%) had no BM. Meanwhile, 507 NSCLC patients at the First Affiliated Hospital of Nanchang University were selected for external validation of the models. The Characteristics of all data is showed in Table 1.

Logistic regression analysis
The univariate analysis based on the training set is presented in Table 2, and the results showed that marital was not significantly associated with the BM in NSCLC patients (P > 0.05). The remaining ten significant variables were selected for multivariate logistic regression analysis. Multivariate logistic regression analysis (Table 2) indicated that seven variables, including sex, grade, laterality, histology, T stage, N stage, and chemotherapy, were independent risk factors for BM of NSCLC. The study flow chart of case screening.
Li et al. 10.3389/fonc.2022.1054300 Frontiers in Oncology frontiersin.org  The seven variables were used further machine learning model study.

Correlation analysis of features
Correlation analysis between data features is often used to measure the degree of correlation between factors. To identify the independence between features, we obtained a correlation heat map by Spearman correlation analysis. The figure showed that there was no strong correlation among these 11 features (Figure 2).

Importance of features on prediction
The importance of features extracted from each machine learning algorithm are shown in Figure 3. Variables screened by univariate and multivariate logistic analysis all have made extraordinary contributions to prediction in the six models. Nstage ranked top one in feature importance of all prediction models, indicating that N-stage has a great influence on BM of NSCLC, followed by T-stage. In most algorithms, grading, histology, laterality, gender and chemotherapy ranked the last five, with no significant difference in their contributions to the model. N-stage, T-stage, chemotherapy, histology, grade, sex and laterality are arranged in descending order in XGB model.

Model performance
The performance of the six predictive models is described in Figures 4, 5 and Table 3. Internal ten-fold cross-validation ( Figure 4) showed that XGB model performed best among the six models with an average AUC of 0.808, followed by the GBM model (AUC=0.804). External test validation was shown in Table 3 and Figure 5. Interestingly, the XGB model also achieves the best AUC score (0.841) in the external test validation and the score of accuracy, sensitivity (recall rate) and specificity were 0.744, 0.735 and 0.803, respectively. The confusion matrix (Figure 6) of the XGB model in the training set and the test set indicated its high accuracy.

Web predictor
In this study, a web predictor based on the XGB model, which has the best predictive performance on BM of NSCLC patients, was developed to assist doctors to make more accurate clinical decisions. The odds of BM from NSCLC patients could be easily calculated by simply setting the variables associated with BM given on the web predictor. (https://share.streamlit.io/ liuwencai6/lung_bone/main/lung.py) (Figure 7). When the predicted value of the patient is greater than 0.5, the network predictor shows high risk.

Discussion
Lung cancer is one of the most common malignant tumors, which has an annual incidence of 2 million (1). Traditionally, lung cancer is divided into two types, including small cell lung cancer and NSCLC, with NSCLC accounting for 85% of cases (22). Compared with small cell carcinoma, NSCLC patients often has a better prognosis because of its slow growth and division (23). However, BM is considered an essential risk factor for the prognosis of NSCLC (24). Studies have reported that the incidence of BM in patients with NSCLC is 26-36% (3). However, the Lung Cancer National Comprehensive Cancer Network (NCCN) screening guidelines do not recommend routine bone imaging evaluation in asymptomatic patients (10). In order to identify lung cancer patients with high risk of BM, we innovatively constructed a clinical predictor based on an Feature importance of different models. advanced ML algorithm (XGB). Figure 8 shows the construction process of the clinical predictor and the outcomes of the NSCLC with and without the predictor. Artificial intelligence (AI) is a field of research in which computers are applied to mimic human intelligence which has been successfully applied in various fields, including driverless driving, face recognition and music creation (25-27). ML is a subfield of AI that focuses on developing algorithms to learn from data (11). Therefore, the emergence of electronic medical records (28) (EMR) has created a huge amount of analyzable data in the medical field, which provides the potential for the development of ML in the medical field. In biomedicine, there have been many studies using ML algorithms to guide clinical diagnosis and treatment, Including COVID-19 and cancer metastases field (19,29). Statistical review of ML in medical field by Kaustubh Arun Bhavsar et al. suggested that ML techniques can help clinicians make better clinical decisions and improve patient care and overall health (30).
In this study, we compared the performance of six different algorithms and found that XGB algorithm perform best. XGB algorithm is an efficient, flexible and scalable machine learning algorithm classifier that has been widely used in the medical filed, such as COVID-19, Chronic Kidney Disease Diagnosis and BM of prostate cancer (PCa) (19,31,32). Liu et al (19) compared the diagnostic ability of six algorithm models, including XGB, to predict BM of PCa, and found that XGB model performed best. The consistency of the results indicates that the XGB algorithm has great potential in medical applications. Compared with the traditional logistic regression model, XGB model can process big data efficiently and has higher accuracy (33). Table 4 shows the strengths and weaknesses of the previous model as well as the proposed model (33).
In this study, logistic regression analysis helped us screen out seven independent risk factors for BM in NSCLC. And, through various ML algorithm verification, it is found that all features have essential contributions in the process of predicting BM, which is in high agreement with the logistic regression analysis.
In previous studies (34-36,) T stage, N stage and grade are considered risk factors for BM of cancer. Jie et al. found that (37) patients are more likely to have metastases, if they had higher N stage at diagnoses. In this research, we also found that lymphatic metastasis promotes BM in NSCLC patients and with increasing N stage, the risk of BM in NSCLC patients also increases. Fan et al. (35) found that T stage and tumor grade affected BM of renal cell carcinoma. In this research, we also found that T stage and tumor grade play an important role in predicting the development of BM in NSCLC patients. And the BM of NSCLC was related to the higher tumor grade and advanced T stage.
Interestingly, BM in NSCLC are more likely to occur in patients who have been treated with chemotherapy drugs in this study, which may be due to the misuse of chemotherapy drugs, resulting in their toxic effects on normal cells and promoting the proliferation of tumor cells (38,39). Hui et al. (40) found that adenocarcinoma was related to a high risk of BM. In this study, we found that tumor histological type is an important character that affects BM. And adenocarcinoma, the most common tumor type of NSCLC, is more prone to BM than others. In addition, sex and the laterality of the primary tumor site can also affect BM of NSCLC patients. Many studies have shown that gender affects the development of tumors (41,42). In this study, we found that male at a higher risk of bone metastases than females. Moreover, FIGURE 5 The roc curves of different machine learning models in external test set. we found that left primary lung cancer is more likely to have bone metastases than right, which may be associated with the left lung being close to the heart, leading to more hematogenous metastasis of the left lung tumor.
In this study, we constructed a predictor based XGB algorithm with SEER data to predict BM in NSCLC. This research can help clinicians make better clinical decisions and promote the integration of medicine and machine learning. Meanwhile, this study has some limitations. First, Current machine learning is almost entirely statistical or black-box, bring severe theoretical limitations to its performance (43). Second, we cannot comment on which chemotherapeutic agents affect NSCLC BM because the SEER database does not record the treatment regimen and dosage of chemotherapy patients. Third, most of the variables in SEER database are clinical, which limits the accuracy of model prediction to some extent. Fourth, important parameters closely related to lung cancer such as dust, smoking, passive smokers, Tobacco chewing, and alcohol are missing in the SEER database, resulting in the failure of these parameters to be included in the predictor. In the future, with the continuous improvement of the database, we will incorporate more correlation parameters associated with the BM of NSCLC into the web predictor to improve its adaptability.  A web predictor for predicting bone metastases in no small cell lung carcinoma patients.
Li et al. 10.3389/fonc.2022.1054300 In conclusion, in this study, we found that XGB algorithm performed best in six different algorithms and then as a tool build a web predictor for predicting BM of NSCLS which was accurate, simple and convenient to operate. This web predictor can predict BM of NSCLS easily and assist clinicians in diagnosis and making better clinical decisions for NSCLS patients.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ supplementary material.

Ethics statement
We received permission to access the research data file in the SEER program from the National Cancer Institute, US. Approval was waived by the local ethics committee, as SEER data is publicly available and de-identified. This study was approved by the Ethics Committee of the First Affiliated Hospital of Nanchang University, and cases from the First Affiliated Hospital of Nanchang University signed a written informed consent form. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s), and minor(s)' legal guardian/next of kin, for the publication of any potentially identifiable images or data included in this article. The construction process of the clinical predictor and the outcomes of the NSCLC with and without the predictor.

Author contributions
MPL and WCL conceived of and designed the study. MPL, WCL, BLS, and NSZ performed analysis and generated the figures and tables. MPL and WCL wrote the manuscript, and ZLL, SHH, ZHZ and JML critically reviewed the manuscript. All authors contributed to the article and approved the submitted version.

Funding
This work is supported by the "Double Thousand Plan" Talent Project of Jiangxi Province and the central government guides local funds for scientific and technological development, China (No. 20222ZDH04095).