Font Size: a A A

Identification Of Cancers With Unknown Primary Tissue-of-origin Based On DNA Methylation Levels

Posted on:2019-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:T LuoFull Text:PDF
GTID:2404330566460858Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Cancer of unknown primary(CUP)refers to metastatic cancers of which the site of origin remains unknown after a standard diagnostic approach.Metastasis refers to the process that tumor cells spread to other parts of the body through the circulatory system from it's origin.Determination of primary site may lead to appropriate treatment,thus the improvement of patient prognosis.Clinical examination,imaging and pathological examination,which are often used to detect the tissue of origin,can only recognize the tissue of origin of about 50%~80% CUP patients,and have limited power in the rest,which require more effective methods.DNA methylation is an important epigenetic modification.Methylation levels are tissue-specific,and can be helpful in recognizing the tissue of origin of tumors.Machine leraning algorithms are able to find regular patterns in huge amount of methylation data,and classify unknown samples based on the patterns,which may help tissue-of-origin detection.We developed several classifiers for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation.Firstly,we collected methylation data of 31 tumors from TCGA database,filtered out outliers and features with too many missing values,and use principal component analysis,non-negative matrix factorization and singular value decomposition to reduce the dimensionality of the matrix.Then we built 8 classifiers(LASSO,neural network,random forest,support vector machine,linear discriminant analysis,k-Nearest Neighbors,decision tree,na?ve bayes)with different machine learning models and evaluate them.We found LASSO and neural network perform best in 5-fold cross validation,with accuracy rates of 96.77% and 96.76%,respectively.We tested our model on an independent testing set from GEO with 10 tumors,and LASSO reached a total accuracy of 91.97%.After that,we compared the accuracy of classifiers using methylation level,mRNA,micro RNA and long noncoding RNA as the feature set,respectively,and found that methylation-level-trained classifier has the highest accuracy.Finally,to rank the features and improve training efficiency,we proposed maximum F-statistic maximum distance(MFMD)method,which calculate the weighted average of the F statistic and the average of Euclidean distances of a probe and the other probes,to rank the features.The LASSO classifier trained by the top 5000 probes achieved an accuracy of 95.05%.In the second part of this dissertation,we built a web server,CUPtracer(http://cuptracer.i-sanger.com/),for tissue-of-origin prediction.CUPtracer was built based on the framework of web.py.Format conversion tools for commonly used methylation callers as well as mail notification service are provided on CUPtracer.The parameters of all models are set to optimal.CUPtracer offers a convenient way to identify the tissue of origin of CUP for researchers who do not have programming skills.In a word,we have developed classifiers for the identification of tissue of origin of CUP with high accuracy,and built a web server based on the classifiers.I hope this dissertation could provide ideas and tools in relevant studies.
Keywords/Search Tags:tissue of origin, cancer of unknown primary, methylation, machine learning, classifier, CUPtracer
PDF Full Text Request
Related items