Font Size: a A A

Research On Attribute Selection Algorithm Based On Analysis Of Correlation Between Attributes

Posted on:2010-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:J Z ShaoFull Text:PDF
GTID:2178360275473280Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data mining is a new technology which extracts potential and useful information from lots of daily transactional data. Data mining algorithm often has strict requirement of data set such as good integrity, little data redundancy, weak attributes correlation and so on. However, the daily transactional data may be incomplete, redundant or indistinct, etc. So, preprocessing is usually required to be performed upon raw data before applying data mining algorithms. Attribute selection is an important method of data preprocessing, by which the noise of data sets can be reduced and the algorithms of data mining can be more effective.In this paper, we firstly introduce attribute selection related theory and basic concepts of information theory. Then we detailedly analyze the static organizational structures and dynamic running processes of algorithms in the package of attribute selection. Then we introduce the existing correlation-based evaluation methods and describe the new analysis of redundancy between attributes and the evaluation criterion of max-relevance and min-redundancy in great depth. Finally, two novel attribute selection algorithms based on the analysis of correlation between attributes are designed. One is attributes redundancy removal algorithm, which uses decision independent correlation and decision dependent correlation to respectively measure the relevance between one attribute and the class attribute and the redundancy between one attribute and another attribute. The other is rank-wrapper algorithm, which is two-stage approach. In this algorithm, first rank method uses the criterion of max-relevance and min-redundancy to select some good attribute subsets, and then wrapper method uses cross-validation to select the best attribute subset. The classification algorithms Naive Bayes and C4.5 are used to evaluate the result of attribute selection. As testified by experiments, in most of data sets, these attribute selection algorithms can effectively select attributes and maintain classification performance at the same time.
Keywords/Search Tags:Data mining, Attribute selection, Information theory, Weka, Correlation
PDF Full Text Request
Related items