Font Size: a A A

Multi-source Data Integration Based On Logistic Regression For Identifying Disease Genes

Posted on:2019-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2404330572452112Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cancer,also known as malignant neoplasm,has caused a great threat to human health.With the completion of the Human Genome Project,detecting disease-associated genes becomes the basis of understanding the pathogenesis of disease,the prevention and treatment of cancer.The prediction of disease genes act as the central issue in the field of biomedical research.Many researches show that the disease genes tend to lie closely to each other in many biological networks if they are associated with the same or similar diseases.So,the effective integration of multiple data can improve the accuracy of human disease genes' identification.Cancer related genes' identification is a typically imbalanced classification,since the number of known cancer-related genes is far less than the unknown genes,which makes it very hard for the detection by using a regular machine learning method.In this paper,we carry out the research on the detection of human disease genes and the results are as follows:1.The work proposes a multiple source data integration algorithm for the detection of human disease genes.Human protein complex data were added to the process of prior label estimation,and multiple data were also integrated effectively by the feature vector reconstruction in the binary logistic regression algorithm.This study also integrate the information of gene-gene connection in the algorithm.The related experimental results show that the proposed algorithm improves the accuracy of the identification.2.In order to circumvent the imbalanced classification issue,the algorithm based on multi-step regression and random re-sampling is conducted to identify genes related to the target cancer based on the above method.The process is divided into two main stages.In the first stage,the purpose is to find the genes related to the cancer class.The genes related to all cancers are merged together to form a positive instance set,which has solved the imbalanced classification issue to a certain extent.Then,deleting the genes with bad performance during each time of logistic regression to make the data balanced.In the second stage,the purpose is to find the genes related to the individual cancers.The algorithm based on multi-step regression and random re-sampling method overcome the imbalanced classification for identifying individual cancer related genes and improves the accuracy of the prediction.3.In this study,the proposed method is compared with other state of the art methods.This paper perform a biological pathway enrichment analysis for five cancer related disease genes and make an explanation on biological significance for each corresponding pathway.In conclusion,the method proposed in this research has a good performance on the identification of disease related genes,and provides the reference for the prediction,diagnosis and treatment of cancer.
Keywords/Search Tags:disease related genes, multi-step logistic regression, multiple data integration, random re-sampling
PDF Full Text Request
Related items