Font Size: a A A

The Research Of The Differentially Expressed Genes In Disease Based On The Granular Computing

Posted on:2019-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:M M SunFull Text:PDF
GTID:2394330548482855Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Based on the theory of granular computing,this thesis combined existing clustering algorithms and data mining algorithms such as Logistic Regression and Random Forest to improve and optimize these algorithms and establish an optimization model.The virus protein sequence and gene chip data were downloaded from NCBI and GEO databases respectively.And the numbers downloaded were characterized and preprocessed.The model was applied to the processed data and the results were analyzed to provide new and effective information for the study of biological information.The main work of the article is summarized as follows:The second chapter is to prepare the knowledge,introduced the principle of various clustering algorithms,some basic concepts in the granular space,minimum spanning tree algorithm,Logistic Regression model and Random Forest model.In the third chapter,according to the granular computing theory,minimum spanning tree classification algorithm is proposed based on normalized metric.Firstly,based on the existing representation and generation algorithm of granular space,by introducing the minimum spanning tree and the new optimization clustering index based on the intra-class deviation and inter-class deviation,an optimal model was established.Furthermore,898 avian influenza viruses containing both HA and NA protein were used as an experimental database.Based on the characteristics of avian influenza virus data sets,the 898 avian influenza viruses were divided into two classes by running the algorithm first time.In order to further study the nature of avian influenza virus,the two types of influenza viruses were analyzed separately by the algorithm again.Based on the nearest principle,6 representative viruses were selected and a phylogenetic tree was constructed.Finally,comparing the results with those in the literature,we found that the variation of human influenza virus is closely related to the region and the outbreak time.These results are consistent with the results of previous studies,indicating that the algorithm is effective.The minimum spanning tree classification algorithm has lower complexity than the original algorithm in finding the optimization clustering.The object of the fourth chapter is cancer.To screen differentially expressed genes quickly and efficiently on two gene microarray datasets of breast cancer,by combining the Logistic Regression and Random Forest algorithm,this thesis proposed a novel method named LR-RF to select differentially expressed genes of breast cancer on microarray data by the Bonferroni test of FWER error measure.Comparing with Logistic Regression and Random Forest,our study shows that LR-FR has a great facility in selecting differentially expressed genes.The average prediction accuracy of the proposed LR-RF from replicating random test ten times surprisingly reaches 93.11% with variance as low as 0.00045.The prediction accuracy rate reaches a maximum 95.57% when threshold value ? =0.2 in the random forest algorithm process of ranking genes’ importance score,and the differentially expressed genes are relatively few in number.In addition,through analyzing the gene interaction networks,most of the top 20 genes we selected were found to involve in the development of breast cancer.All of these results demonstrate the reliability and efficiency of LR-RF.
Keywords/Search Tags:Granular computing, Minimum spanning tree classification algorithm, LR-RF algorithm, Differentially expressed genes, Gene interaction network
PDF Full Text Request
Related items