Font Size: a A A

Study On Attributes Reduction Of Gene Signals Based On The Rough Set Theory

Posted on:2016-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2298330467998656Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
k-mer frequency is widely used once found as a digital feature which cancharacterize the DNA sequence because of its strong conservative and representativein the field of DNA sequence’s recognition and classification. Its dimension is4k, andthe more dimensions it has, the larger the amount of calculation is, the lower theefficiency is. Therefore, how to solve the problems of slow classification speed andlow efficiency is badly in need of solution. Rough set theory’s attributes reductionmethod is an efficient means to reduce the features. It can remove redundant featuresand filter the good characteristics as the feature tags which are applied to theclassification identification. However, rough set only can handle discrete data, but inpractice, most features including k-mer frequency are continuous. Therefore, how todiscretize these continuous features is a problem must be solved before using roughset theory. This article carried out these following several aspects of work around theabove two problems.(1) Proposing continuous feature discretization algorithm based on the maximumbetween class variance(OTSU). Firstly, the range of continuous features was dividedinto L levels (L>1). And then the variance among these intervals was counted. Finally,the continuous features were discretized by the breakpoints corresponding to themaximum variance between classes. The experiments of6continuous datasets on UCIdatabase show that this L value can get the best classification accuracy in selectedintervals which can be seen as the optimal L value.(2) Reduction for k-mer frequency after discretization based on rough setattribute importance. Used the function of attribute importance in rough set to reducethe k-mer frequency after discretization and then filtered good features as the specieslabels for each species. The experiment on30bacteria in National Center forBiotechnology Information (NCBI) shows that the attribute reduction method candecrease the run time then improve the efficiency under the situation of almost constant even risen classification accuracy.(3)145groups bacteria sequence source codes were downloaded from NCBI andthe4-mer and5-mer frequency(k=4,5) were extracted as the feature vectors for eachsequence. Discretized the k-mer frequency based on OTSU, then reduced it afterdiscretization. Classified the remaining features by KNN classifier on sevenclassification levels. Used classification accuracy as indicators to judge our methodand compared with the results without reduction and with the other discretization andreduction method. The experiment results show that reducing gene features caneffectively filter out the outstanding characteristics and get a better classificationeffect to improve the efficiency of classification. At the same time, the discretizationalgorithm based on OTSU for continuous features can effectively improvediscretization results.
Keywords/Search Tags:Rough set, discretization, k-mer frequency, attributes reduction, maximumbetween class variance(OTSU)
PDF Full Text Request
Related items