Font Size: a A A

Analysis Of High-throughput Gene Based On The Improved Relief And SVM Algorithm

Posted on:2013-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhuFull Text:PDF
GTID:2248330374462877Subject:Biological Information Science and Technology
Abstract/Summary:PDF Full Text Request
In the current biological research, gene chip research plays an important role ingene sequencing technology. At the same time, gene chip research can receive a largeamount of gene microarray data in a short period. How to extract and classify thedata what have a high data dimensionality, correlation, redundant features and noisefeatures fast, efficient and accurate, getting the result can use less feature informationto determine with which of the known genes have the similar character with its datainformation. If we do this successful, we can explore the unknown genetic worldaccurately. This work requires using the feature selection algorithm in the field ofbioinformatics, to complete the data classification and dimension reduction. However,due to think over correlation, hedge and unrelated gene of the data, the single featureselection algorithm usually get the redundant and error result. If we use this resultwhich has an obvious defect to guide the classification, the final result we get maylead to some serious consequences, such as come to a wrong decision, have no effectin the actual sense. This article wants to reduce the error rate of the feature selectionalgorithm, improve the calculation efficiency of the algorithm, mainly includes thefollowing four aspects:1、In view of one of the tradition feature selection algorithm-Relief algorithm,according to the principle and mathematical interpretation, this article provides theprogram implementation combined with specific gene microarray data. Achieving thecombination of the theory and practice, show the overview of the algorithm. At thesame time, we could clearly know the advantages and disadvantages of the featureselection, as well as the content we need to pay attention to in the specific application.2、This article presents a new feature selection algorithm based on the wrong areaof the Relief, improving the traditional Relief algorithm. This feature selectioncombines the linear classifier, sequence forward selection algorithm and ReliefF,constructs a new combinatorial optimization algorithm. If the error rate of the featuresets is in a high condition, this algorithm can complement some representativefeatures in the wrong sample sets. This work can reduce the error rate of the featureselection algorithm effectively. To know how efficient the improvement of ReliefFalgorithm is, we finally compare with the two algorithms for dimensionality reductionperformance.3、In order to effectively reduce the microarray data redundancy, control thedimension of the final data set in a low level, we reduce the dimension of the originalsample data twice. We use kernel principal component analysis algorithm select the feature of the data sets which have been reduced by ReliefF in a second time. Toovercome the interference of the multiple correlation data, we joined a clusteringanalysis algorithm for data preprocessing. The processed data ensure the finalreduction result accuracy and efficient, decreases the computational complexity at thesame time. Finally, we test the complexity, accuracy of the different feature selectionalgorithm through the experiment.4、Finally we use the complete combinatorial optimization algorithm which weconstructed and a variety of other compared algorithm to deal with the same genemicroarray data sets. According to the comparing results, we can see that thecombinatorial optimization we constructed has a lower iteration, and has a higheraccuracy compared with the single feature selection algorithm. At the same time, thiscombinatorial optimization improves the effect of the dimensionality reduction andthe efficiency of the calculation, comparing with the incomplete combinationalgorithm. This conclusion tells us that this combination optimization algorithm hascertain advantage when it is used to reduce dimension of the high-throughput genomicdata sets. However, in the area of parameter selection, model structure and otheraspects, this combination optimization algorithm needs to be improved deeply.
Keywords/Search Tags:Feature Selection, Relief, KPCA, SVM, Dimensionality Reduction
PDF Full Text Request
Related items