| Lung cancer is a complex disease involving the genetic and epigenetic changes, and is a major cause of cancer deaths worldwide. In recent years, lung cancer incidence in China is increase significantly, and becomes a major challenge of public health. Although, with the improvement of health level and the development of clinical drug application, the treatment of lung cancer was effectively improved, the survival rate and the overall prognosis of patients with advanced lung cancer are still at a relatively low level. Therefore, improving the efficiency of early diagnosis is the key to improve the prognosis of patients with lung cancer.Epidemiology indicated that there was a powerful statistical correlation between smoking and lung cancer. It is estimated that there are 1.25 billion smokers worldwide, and more than one million people die from lung cancer caused by tobacco each year. Smoking is closely related to the development of lung cancer, and 85% of lung cancer is associated with smoking, and 2 years’ survival rate is less than 10%. The United States screening center for lung cancer recommended that patients with the smoking history over 30 years and no smoking history less than 15 years should be performed for lung cancer screening.Tumor markers are biological chemicals reflected in the tumor. They don’t exist in normal adult tissue, or exist only in embryonic tissue. They are significantly high in tumor tissue compared with that in the normal tissue. The quality or quantity changes of these markers may indicate the nature of the tumor, which may contribute to understand the tumor tissue, cell differentiation and cell function and help to the diagnosis, classification, prognosis and treatment of tumor. However, the detection sensitivity and specificity of a single tumor marker are usually lower than that of tumor markers group composed of multiple tumor markers with different characteristics, sensitivity and complementary. Therefore, multiple tumor markers more joint diagnosis is usually used to improve the detection of early lung cancer at present.Imaging is one of the important methods for the diagnosis of lung cancer. But low sensitivity of X ray film is also the main cause of the delayed diagnose of lung cancer. Recently, studies on lung cancer screening tests in the United States have shown that the lung cancer mortality of low-dose CT scanning was reduced by 20% compared with that of X-ray in lung cancer high-risk population. As a result, this examination was recommended by the U.S. preventive services center, the American cancer society and other consultants. Now, low dose CT is used in the detection of the highly suspected lung cancer patients, and it has high sensitivity to help find and identify early lung cancer. However, the specificity of CT for lung cancer diagnosis is poor. Tumor markers showed a lower sensitivity than dynamic CT, while its specificity is higher than CT scans. Therefore, the combination of CT scanning and the tumor markers can help to distinguish lung cancer and benign pulmonary disease.Data mining techniques, as a modeling tool, has proved its ability of absorbing information from multiple sources and precise analysis and building complex models. Now, a lot of researches have combined tumor characteristics with data mining technology to diagnose the tumor. Although many factors are involved in lung cancer diagnosis and there is a complex relationship among them, the data mining technology could learn fuzzy evaluation, which can’t be described through mathematical approach, and solve some complex, uncertain and nonlinear problems. When faced with large samples, multimedia and multivariable especially, data mining technology shows better ability in dealing with the problem of nonlinear and unknown data distribution.Objective: The present study, based on the results of the early stage research, combined the serum tumor markers, epidemiology, clinical symptoms and imaging characteristics, and used data mining technology to establish a lung cancer- lung benign disease auxiliary diagnosis model, in order to further improve the accuracy of lung cancer diagnosis and provide a reference for diagnosis of lung cancer and the auxiliary method, improving the survival rate and prognosis of patients with lung cancer.Methods: 1. The serology specimens of 423 cases of the first affiliated hospital of zhengzhou university respiratory medicine hospitalized patients were collected from October 2014 to March 2016, and their serum tumor markers levels were detected. The epidemiology and clinical information of hospitalized patients from modified and completed medical records in hospital by resident or attending physician were extracted, including gender, age, smoking history, drinking history, family history(tumor), cough, sputum, blood in phlegm, fatigue, fever, sweating and voice hoarse. 2. Fisher discriminant analysis and Logistic regression analysis methods were used to select and optimize serum tumor markers, epidemiology and clinical symptom indexes. 3. Selected and optimized indexes were used to establish diagnose models of lung cancer through the data mining technology(ANN and SVM, decision-making tree C5.0) and Fisher discriminant analysis. 4. Collecting CT imaging data of 423 patients at the same time,and selecting the CT image of 214 cases as the research objects according to the case inclusion and exclusion criteria. 5. Three high qualification attending physicians were invited to determine CT images of 214 patients, extract 19 feature and score, respectively. The final score of the various imaging features was taken by an average of three physicians. 6. Nineteen imaging indexes of the extracted were selected and optimized by Fisher discriminant analysis and Logistic regression analysis methods, and these indexes were used to establish diagnose models of lung cancer through the data mining technology(ANN and SVM, decision-making tree C5.0) and Fisher discriminant analysis. 7. Fisher discriminant analysis and Logistic regression analysis methods were used to select and optimize a series of indexes of serum tumor markers, epidemiology and clinical symptoms, CT imaging characteristics, and these indexes were used to establish diagnose models of lung cancer through the data mining technology(ANN and SVM, decision-making tree C5.0) and Fisher discriminant analysis.Results: 1. The sensitivity, specific degree, accuracy, positive predictive value and negative predictive value and AUC of models based on the combination of epidemiology and clinical indicators were significantly higher than that of models based on single tumor markers detection alone. 2. In models established by tumor marker, epidemiology and clinical symptom indexes, the sensitivity, specific degrees, accuracy, positive predictive value, negative predictive value and AUC of ANN model were higher than other three kinds of models, and the difference of the area comparison under the ROC curve was statistically significant(P < 0.05). 3. There was no statistically significant difference of AUC between the established ANN model based on combined tumor markers, epidemiological and clinical symptoms indexes. However, the ANN model was established by 13 indicators after 10 kinds of tumor markers, epidemiological and clinical symptoms optimized by Logistic stepwise regression analysis, namely, age, sex, smoking history, sputum, blood in the sputum, fever, sweating and DNMT3 B, DNMT1, HDAC1, gastric secrete element, NSE, CEA and calcium ion, and the training set accuracy was 100%, the forecasting accuracy was 94.33%, the specific degree was 95.5%, and positive predictive value was 93.8%, all of which were higher than other models. 4. The SVM model was established based on hole, the spine and tracheal stenosis 3 variables selected by Logistic stepwise regression analysis, and the accuracy, the negative predictive value and AUC of the prediction results were higher than other models, which were 86.9%, 91.8% and 0.857 respectively. The degree of sensitivity, specific and positive predictive value is also higher than other models, which were 92.3%, 81.8% and 90.6%, respectively. 5. the SVM model was established based on 16 indexes selected from serum tumor marker, epidemiology, clinical symptoms, imaging by Logistic regression analysis, and the specific degree, accuracy, positive predictive value and AUC of the lung cancer predicted results were respectively 95.5%, 97.2%, 95.4% and 0.969, and the sensitivity and negative predictive value were 99.0% and 95.4%. 6. The lung cancer diagnosis efficiency of the SVM and decision-tree C5.0 models established by combined serum tumor marker, epidemiology, clinical symptoms and imaging index was superior to that of the SVM and decision-tree C5.0 model based on the single imaging, and the difference of AUC was statistically significant(P < 0.05).Conclusion: 1. Fisher discriminant analysis and Logistic regression analysis methods were used to select and optimize the indexes of serum tumor markers, epidemiology and clinical symptoms, then the model ANN of lung cancer diagnosis was established based on these selected and optimized indexes, and the degree of sensitivity, specific, accuracy, positive predictive value, negative predictive value and AUC of this model was significantly higher than that of data mining model established based on combined pure serum tumor markers, which could better the clinical auxiliary diagnosis for lung cancer. 2. The SVM model established based on hole, the spine and tracheal stenosis 3 variables selected by Logistic stepwise regression analysis for lung cancer diagnosis could be used as a method of clinical imaging diagnosis of lung cancer. 3. The lung cancer diagnosis efficiency of the SVM and decision-tree C5.0 models established by combined serum tumor marker, epidemiology, clinical symptoms and imaging index was superior to that of the SVM and decision-tree C5.0 model based on the single imaging, which could be used as a kind of optimized method for lung cancer clinical auxiliary diagnosis. |