Font Size: a A A

Research On Gene Selection Based On Max-Relevance And Min-Redundancy Feature Selection Algorithm

Posted on:2017-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:D S ZhouFull Text:PDF
GTID:2428330488979873Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Gene expression data has often been used in recognition and diagnosis of cancer.With the vigorous development of biological information science,DNA chips that can produce this kind of data,well known as a advanced and high throughput detection technology,has been widely applied to the disease diagnosis and drug screening.Huge amounts of data generated by DNA chip technology is typically characterized as high dimension,high noise and more redundant features.Besides,most of the genes plays no roles in the classification of diseases and the number of samples is commonly insufficient.Selection of feature gene can not only help us to sort out the perfect genes that can distinguish target sample perfectly but also at the same time significantly reduce the computation complexity of time and space.Above all,feature selection is of great importance.In this paper,relevant research on the problem which feature selection of gene expression data exists is conducted.The main research work includes the following aspects:(1)The traditional feature selection algorithm often takes the role of single gene into consideration and typically neglects the interactivity among them,therefore the final feature subset includes more redundant data and leads to lower accuracy in classification process.This paper proposed a Max-Relevance and Min-Redundancy algorithm based on weighted gene co-expression network(NmRMR)and applied it to the recognition and diagnosis of cancer.The method calculates and builds the dissimilarity topology by correlation between one gene and any other.And then the weighed network can be constructed and be divided to several models.Secondly,we find out the most significant gene module most relevant to sample class by calculating the correlation between modules' eigenvalue and target class.At last the most optimal gene subset can be obtained by using mRMR algorithm in this module.We use three kind of public gene dataset to validate the distinctive NmRMR method we proposed,and verify it's accuracy of classification and predication by using DT and support vector machine(SVM)these two classifier.The experimental result shows that the NmRMR algorithm can effectively improve the classification accuracy and reduce the redundancy.(2)This paper proposes another kind of feature selection method:Max-Relevance and Min-Redundancy algorithm based on analogous group.We aim to improve the classification accuracy of the model by constructing analogous groups.AGmRMR algorithm orders the module most relevant to disease on the basis of module importance index.At the same time the gene connection degree is calculated and typical gene sets in each module are obtained,then analogous groups are built iteratively and optimal feature set can be selected by using mRMR.Finally the feature subset most associated with the disease is obtained.The experimental result shows that AGmRMR can not only compensate the disadvantage of single information module but also effectively improve the classification accuracy of feature selection.
Keywords/Search Tags:Weighted co-expression network, Gene expression data, Feature selection, Analogous group, mRMR
PDF Full Text Request
Related items