Font Size: a A A

Data Mining And Feature Selection Of High Dimensional Biomedical Data Based On TCGA And Pubmed Databases

Posted on:2018-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:G Y ShanFull Text:PDF
GTID:2348330518465258Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the rapid development of life science technology, especially the development of sequencing technology, all kinds of research data are in expansion. Biomedical data is not only a huge amount of data, but also has high dimensional characteristics and it is very common that the number of features is much larger than the observation (sample size). Therefore, the emergence of these data brings new opportunities as well as new challenges for research scholars. The traditional point-to-point research model will no longer apply, how to dig out the massive data relationship chain has become the focus of research work. Feature selection refers to the selection of a subset of the original data,which can represent the characteristics of the original and be used for subsequent data mining work. It is no exaggeration to say that feature selection in data mining will be like mining gold in massive sands. Almost any complete data mining work cannot avoid this step. Therefore, this paper takes the feature selection technology as the breakthrough point,and explores the bioinformatics methods related to high-dimensional biomedical data with two important biomedical problems. Through this study, we will propose different features extraction strategies from multiple levels, and further study these strategies in the actual biomedical problems in the characterization and prediction ability. The feature selection methods and results presented in this paper can provide important reference for the high-dimensional biomedical data process and analysis.Feature selection is mainly used in the field of machine learning and statistics,which refers to the selection of close variables from a large number of variables for model construction. Feature selection has three main advantages: simplifying the model to make it easier to understand, shorten the model training time and reduce the over-fitting to increase the model's generalization ability. In practical research,most of the variables in the variable set are redundant information relative to the research problem,and remove them will not result in the loss of information. Therefore, feature selection is an indispensable step to deal with massive high-dimensional biomedical data. As the 14th century philosopher Willian put forward the "Occam razor" law: If not necessary,do not increase the entity. It can be said, feature selection, simplified model is the soul of mass data processing. Therefore, feature selection is a crucial step for the processing of mass biomedical data, and it is also the start point of this paper.At present, there are two principal methods in regards to feature selection, one is to use the topology of the data itself for statistical signal screening, and the other is the introduction of external knowledge, such as some specific areas of background knowledge. This uses the data from the TCGA (The Cancer Genome Atlas) database to comprehensively experiment with both methods for predicting the prognosis of cancer.First of all, in the use of the topology of the data itself, we focus on the hepatocellular carcinoma gene and microRNA diagnostic markers screening and discovery. In a network, the node with a relatively high degree is called a "hub". We combine the survival analysis technology with feature selection and study the topological characteristics of the survival-related molecules. It is found that the survival-related genes in these hub nodes are enriched, suggesting that these hub nodes are more likely to be potential markers (namely, molecular markers) for the prognosis of hepatocellular carcinoma. Second, in the field of knowledge, we focus on cancer chemotherapy resistance prediction. The main reason for the failure of tumor chemotherapy is often due to the occurrence of tumor Multiple Drug Resistance (MDR). Drug resistance is a relatively complex process, usually due to overexpression of resistance-related genes encoded by the protein, through the role of energy-dependent elution pump that pump out the chemotherapeutic drugs outside the cell, thereby reducing the accumulation of chemotherapy drugs in the cell, leading to the body's drug resistance. In this study, eight mutations associated with drug resistance in cancer were identified using relative risk and P-value screening. These mutations were used as a feature set of the model. Using this feature set, we used three machine learning methods to predict the drug resistance of eight cancer samples and performed well. Especially in the Head and Neck Squamous Cell Carcinoma (HNSC) area under the Curve (AUC) can reach 0.980,indicating that a good distinction between drug-resistant and non-resistant samples, in order to help patients to select the appropriate treatment to provide an important reference. In addition to drug intervention, more and more studies have shown that dietary intervention is also an important means for human health regulation. Therefore,except for studying the prognosis of cancer treatment, we have also predicted carbohydrates (also known as prebiotics) that are beneficial to human health based on massive text data downloaded from PubMed database. We extracted all the literature of 15 known prebiotics from the PubMed database and used them to model the predicted carbohydrates and calculated the list of potential prebiotics. This method extracts the prebiotics from the PubMed database, which can provide reference for other data mining scholars. The predicted potential prebiotics can also provide an important reference list for the study of prebiotics.With the large-scale biomedical era is coming, data mining becomes increasingly important. Data mining method is helpful to understand life from a systematic level. It is an important method to study life science. Feature selection is the soul of data mining.On this basis, we will consider the integration of textual data, biological expression data for simulation and analysis in future studies, and make some meaningful attempts to improve human health in the future.
Keywords/Search Tags:Feature selection, Data mining, Text mining, Drug-resistance prediction
PDF Full Text Request
Related items