Font Size: a A A

Study Of The Early Warning Model For Lung Cancer Based On Data Mining

Posted on:2013-02-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:N WangFull Text:PDF
GTID:1114330371474891Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Lung cancer is one of the most frequent malignancies in the world nowadays, its morbidity and mortality are continuously rising, and it constitutes a grave threat to human health. In China, there are about 400,000 people died every year because of lung cancer which morbidity and mortality is the highest in malignant tumors. Studies show that 10-year survival of postoperation can arrive to 92% in patients with stage I lung cancer. However, it is very difficult to diagnose lung cancer at the early stage, and also because of its high grade malignancy, lung cancer patients are usually diagnosed in the advanced stage and lose the best opportunity of operation, so that the total 5-year survival rate is only about 15%. So early detection, early diagnosis and early treatment are vital for lung cancer patients to reduce their mortality. The occurrence of lung cancer is a complex process involved in many factors, lots of genes and multiple steps. As tranditional methods including imageology and bronchial tube, et al. have limits in susceptibility, specificity and adaptability, in recent years a lot of scholars have devoted themselves to exploring new molecular marker and to combined detection of multiply tumor markers in order to find more reasonable and sensitive association.Lung cancer occurs because of both environmental factors and genetic factors. So we search for biomarkers of early warning or diagnosis of lung cancer from two aspects, that is biomarkers of susceptibility and effect. Genetic factors belong to the former which is reflected in the difference of tumour susceptibility and is determined by genetic polymorphism. On the other hand, in many cases many molecular events happen before obvious malignant phenotype, so detecting early molecular events during the occurrence of lung cancer to discover precancerosis or canceration of early phase is also one of the most promising approach. Early biological effects during tumorigenesis include changes of genetics and epigenetics such as DNA methylation and telomere damage.Data Mining, also known as Knowledge Discovery from Database, is a complex process which to extract and to mine unknown and valuable knowledge such as model or regular pattern from mass of data. It is usually related with computer science, and to discovery knowledge through statistics, on-line analysis, information retrieval, machine learning, expert system (relying on past rule of thumb) and pattern recognition etc. There is essential difference between data mining and traditional data analysis. Data mining is to excavate information and discover knowledge without clear hypothesis. Meanwhile, information gained from data mining is unknown, effective and practical. Decision tree and artificial neural networks techniques can parallel process and save large-scale data information distributedly, and also take on well self-adaption, self-organization and strong learning, association and fault-tolerance function. In tumour diagnosis aspect data moning techniques can not only detect suspicious lesion and type but also mine potential pathognomonic markers that constructively contribute to tumour diagnosis.In this study genetic polymorphisms of CYP1A1, GSTM1, GSTT1, mEH and XRCC1 genes, p16 and RASSF1A gene methylation, and telomere length were detected in peripheral blood of lung cancer patients and health people to explore their correlationship. Then data mining techniques were used to detect the relevance between these molecular index and early warning or diagnoisis of lung cancer, to extract effective feature and construct suited prediction model of lung cancer, and to explore wheather it can contribute to increase accuracy rate of lung cancer early diagnosis and the significance of united detection used in auxiliary diagnosis of lung cancer. So that to automaticly early warn, diagnose and classify lung cancer and to provide valuable information in screening high risk populations and clinical diagnosis of lung cancer.Objectives1. To study the association between genetic polymorphism of metabolizing and DNA repairing enzymes and susceptibility to pulmonary cancer, to explore the association between p16, RASSF1A methylation and telomere length and occurrence of lung cancer. To screen out effective molecular biomarkers correlated with lung cancer and find the most significant index so as to come up with initial value for early warning or diagnosis of lung cancer. 2. Combining data mining techniques with above index to construct intelligentized model for diagnosis that can automaticly analyse information for increasing accuracy rate of early diagnosis of lung cancer.Materials and methods1.251 lung cancer patients and 256 health persons were chosen to be study subjects.2. Using AS-PCR to detect genotype of CYP1A1-exon7, using multiplex PCR to detect genotype of GSTM1 and GSTT1, using PCR-RFLP to detect genotype of CYP1A1-Msp, mEH-exon3, mEH-exon4, XRCC1-194, XRCC1-280 and XRCC1-399. Using qMSP to detect methylation levels of p16 and RASSF1A, using RT-PCR to detect telomere length.3. Using SPSS 12.0 statistic analysis software, using chi-square test, t test, rank sum test, Logistic regression to analyze the data, and to explore the association between the above index and lung cancer in order to screen out effective index used in early discrimination model of lung cancer.4. Deviding the samples of each group into training set and testing set by 3:1, using Fisher discriminatory analysis, decision tree C5.0 and BP arithmetic to train the training set and build the model, then using the model to test the testing set by blind method in order to verify its odds, the intelligentized model was developed for early diagnosis of lung cancer.Results1. The frequencies of GSTM1-null, CYP1A1-exon7 mt/mt, mEH-exon3 mt/mt, XRCC1-194 Trp/Trp, XRCC1-280 His/His genotype in case group were significantly higher than those in control group (P<0.05), There was an increased risk of lung cancer for individuals carrying genotypes of GSTM1 (ORadj=1.727,95%CI: 1.211-2.463), CYP1A1-exon7 Ile/val+val/val (ORadj=1.727,95%CI:1.203-2.477), mEH-exon3 wt/mt+mt/mt(ORadj=1.758,95%CI:1.194-2.589), XRCC1-194 Arg/Trp +Trp/Trp (ORajd=1.542,95%CI:1.083-2.196) and XRCC1-280 His/His (ORadj=2.941, 95%CI:1.427-6.060) compared with subjects carrying genotypes of GSTM1 null, CYP1A1-exon7 Ile/Ile, mEH-exon3 wt/wt, XRCC1-194 Arg/Arg and XRCC1-280 Arg/Arg+Arg/His; There was no significant difference for CYP1A1-Msp1, GSTT1, mEH-exon4, XRCC1-399 genotype between the two groups (P>0.05). Building the model of lung cancer discrimination based on above index and The accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (63.59%, 63.25%), (95.64%,82.61%), (84.1%,80.77%), respectively; AUC of the three models was 0.627 (Fisher),0.836 (decision tree),0.821 (ANN), repectively.2. The level of p16, RASSF1A gene methylation and telomere length of peripheral blood in lung cancer group was 0.59 (0.16-4.50),27.62 (9.09-52.86),0.93±0.32, respectively, and there was significant difference between the case group and control group; The hypermethylation of p16 gene and RASSF1A gene and contraction in length of telomere was correlated with increasing risk of lung cancer; There was no significant association between sex, age, tobacco smoking, lung cancer stage, pathological types and hypermethylation of p16 gene, RASSF1A gene and telomere length (P>0.05). Building the model of lung cancer discrimination based on above index and the accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (66.34%,65.82%); (77.26%,75.45%); (72.15%, 71.72%), respectively; AUC of the three models was 0.660(Fisher),0.782(decision tree),0.759(ANN), respectively.3. The hepermethylation level of p16 gene was significantly variant in different genotypes of XRCC1-280; the hepermethylation level of RASSF1A gene was significantly variant in different genotypes of CYP1A1-exon7, GSTM1, mEH-exon3 and XRCC 1-280; the contraction in length of telomere was variant in different genotypes of CYP1A1-exon7 and GSTM1. Building the model of lung cancer discrimination based on above index and the accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (72.15%,70.59%), (93.88%, 93%), (92.96%,89.62%), respectively; AUC of the three models was 0.722 (Fisher), 0.929 (decision tree),0.894 (ANN), respectively; the accuracy rate of decision tree and ANN model for clinical early stage (Ⅰ+Ⅱ) lung cancer was 96.36 and 89.09, respectively.Conclusion1. The genetic polymorphisms of CYP1A1-exon7, GSTM1, mEH-exon3, XRCC1-194 and XRCC1-280, the hypermethylation of p16 gene and RASSF1A gene and the contraction in length of telomere might contribute to the risk of developing lung caner; and the above index made up a tumour marker group for early diagnosis model of lung cancer.2. The discriminative model of lung cancer based on multi-dimension molecular events related to the occurrence of lung cancer is superior to that based on unilateral molelular markers.3. The diagnostic model of lung cancer based on multiple tumour markers and data mining techniques was superior to traditional discriminative pattern, and it was more suitable for analysis of clinical data than conventional statistics, and it could be used in early warning of lung cancer.
Keywords/Search Tags:Genetic polymorphism, DNA methylation, Telomere, Lung cancer, Early warning, Decision tree, ANN, Fisher discriminatory analysis
PDF Full Text Request
Related items