Font Size: a A A

Application Of Machine Learning In Smoking Related Pattern Recognition For Lung Adenocarcinomas

Posted on:2017-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:S X WangFull Text:PDF
GTID:2334330512980476Subject:Chemical engineering
Abstract/Summary:PDF Full Text Request
In the era of massive data,the application and analysis of it have become a hot topic.With the development and improvement of machine learning algorithms,the advent of the era of massive data became easier.Currently,machine learning algorithms have been well applied in many areas including chemical process control,meteorological data analysis,spam identification and filtering,genome-wide associate analysis,and so on.With so much data,especially for massive biomedical data by high-throughput sequencing technology,it became a major problem for how to exclude noisy information and how to identify key information so that the results of machine learning can be helpful to the mechanism analysis and target therapy.Although smoking is the main known pathogenic factor of lung cancer,according to the statistics,the proportion of never smokers in lung adenocarcinoma is getting higher.Therefore,the research about the pathogenesis,molecular difference in current/never smoker lung adenocarcinoma patients became a new world-wide hot topic.In this paper,we used genome-wide gene expression(GE)and methylation(ME)data of smoking-related lung adenocarcinoma patients to identify the signature genes.To do this,we used TCGA dataset as training samples and EDRN as independent validation samples.A novel iterative multi-step selection method was proposed to respectively identify GE and ME signature genes from the genes of the whole genome.This method thoroughly evaluated the importance of genes with their tobacco exposure pattern of patients according to their significantly statistical differences,biological relationships and the contributions to the current/ never smoker classification model.Then,the partial least squares(PLS)method was used to classify current-smokers / never-smokers using multiple iterative optimization to identify signature genes.We aimed to reveal the relationship between smoking and the occurrence of lung adenocarcinoma and to provide the necessary knowledge about the difference between current-smokers and never-smokers of lung adenocarcinoma patients from the genome and molecular biology.As results,43 genes were identified as GE signature genes and 48 as ME signature genes and achieved high classification accuracy.The accuracy of training samples were 79.2% and 87.5%,and for independent validation samples were 86.3% and 76.4%.Additionally,gene pathway analysis proved that most of the signature genes are highly related with cancer development,biological functions,cell development and so on.Most importantly,some of them have been verified by other published experiment research.Compared with other results,the method in this paper had more advantages in pattern recognition,and showed particular superiority.In this paper,the application of CNV data was also been studied and achieved initial results.
Keywords/Search Tags:machine learning, pattern recognition, lung adenocarcinoma, smoke exposure, genome-wide
PDF Full Text Request
Related items