Font Size: a A A

Study Of Data Mining Methods For Gene Express Analysis

Posted on:2009-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GuFull Text:PDF
GTID:2120360245951261Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
DNA chip technology has been booming recently, plenty of DNA sequence data have emerged every day. Some important information of physiology and medicine has been hidden in massive databases of biological information, therefore, adopting the effective means to find out the feature gene which affects classification from the massive databases of biological information, realizing the diagnose that pathopoiesia gene could be distinguished accurately, that will be beneficial to the disease and treats.The characteristic of gene expression data which is large amount, high dimension, small sample, non-linear, lead to high calculate complicated degree and time complicated degree of classifier, and cause ultimate classification results inaccurate. We select three common gene expression datasets as the object of study which is Leukemia, Colon and Prostate, focus on gene expression data classification method based on support vector machines, and three feature selection algorithm includingχ~2 statistics, information gain and SVM normal, adopt the independent testing and 10-breaks cross validation as the method of classification performance evaluation. The target is to provide technical basis for the rapidly accurate diagnosis and identify types of gene expression data. The main research work and conclusions include:(1) Due to the selection of classifier in classifying and analyzing, we compare the features and principles of SVM, k-NN, decision trees, Bayesian, and neural network, select three gene expression datasets as the experiment data, use the five classification algorithms. The results show that, The accuracy rate of SVM algorithm is higher than that of other algorithms, and linear kernel performs much better than the other three. In the independent testing experiment, classification accuracy rate of linear SVM reaches 97.1%, 87.1%, 100% respectively in three data sets, in 10-breaks cross-validation experiment, classification accuracy rate comes to 97.4%, 96.8% and 98.5%, which provides a basis for the classification model and optimization of the classification.(2) In order to enhance the effectiveness of feature gene selection, we used signal to noise ratio, t statistics,χ~2 statistics, information gain and SVM normal algorithm to select the feature subset and compare and analysis the result of the classification of linear SVM. The result of independent testing shows that SVM normal algorithm, information gain,χ~2 statistical accuracy of the classification and stability are better than the signal to noise ratio and t statistics, the identify rate of 10 feature subset of samples reaches 97.1%, 87.1%, 97.1%. The result of 10-breaks cross-validation indicates that SVM normal algorithm much better than other algorithms. In three datasets we chose subsets which contain 10, 17, 13 features respectively, the accuracy rate reaches 100%. In short, the most effective method is that gives first place to SVM normal algorithm;χ~2 statistics and information gain are subsidiary.(3) We developed and realized the signal to noise ratio, t statistics,χ~2 statistics, information gain and SVM normal five feature selection algorithm with Java language on the Eclipse platform, and integrate the process of feature selection and SVM classification algorithm, which can accomplish continuous point feature subset chooses and classifies and analyzing in feature space. The final result could be shown through curve form with the help of JfreeChart components.
Keywords/Search Tags:gene expression profiles, data mining, feature selection, support vector machine, SVM normal algorithm
PDF Full Text Request
Related items