Font Size: a A A

Research And Application Of Discretization Algorithm

Posted on:2010-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2178360302961334Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The discretization algorithm takes an important part in the field of data mining and knowledge discovery, and its performance has a direct impact on the accuracy and efficiency of machine learing. Most of machine learning tools are designed for the datasets only with discerte attribute values; However, the datasets from the realworld often contain continuous attribute values (such as temperature, height, etc). This has an impact on the results of machine learning that often can not get a satisfactory accuracy. Therefore, it is necessary to apply the discretization algorithm to preprocess the datasets prior to data mining.This paper analyzes the most of existing discretization algorithms, and makes a comparison on the time complexity, accuracy and efficiency respectively. Finally we choose the CAIM algorithm to make a further improvement. The CAIM algorithm is a global, static, top-down and supervised discretization algorithm. Compared to other discretization algorithms, although the CAIM algorithm has smaller time complexity, higher accuracy and efficiency, there are still three deficiencies: First of all, it ignores the importance of attributes in the process of discretization; Secondly, it does not take uncertainty of decision table into account; Finally, using the caim values as discretization discriminant is inappropriate. These shortcomings lead to the loss of information, thus affecting the accuracy of machine learning. Towards the three above-mentioned shortcomings, we present two improved algorithms.Firstly, we propose an improved CAIM discretization algorithm for the first two shortcomings of the CAIM algorithm. The improved algorithm measures the importance of attribute based on DSST (difference smilitude set theory), considering uncertainty of decision table for further discretization. By using the C4.5 and the support vector machine, the improved algorithm proposed in this paper achieved a higher recognition rate.Secondly, the CAIM algorithm gets too little cutpoints to achieve high recognition rate, thus we propose a novel algorithm calledλ-CAIM based on class-attribute interdependence. Theλ-CAIM algorithm usesλcontingency coefficient which commonly used in statistics as a discrete discriminant, avoiding the deficiency by using caim discriminant. The results show that theλ-CAIM algorithm obtained higher recognition rate in the classification.
Keywords/Search Tags:Discretization, CAIM algorithm, Difference smilitude set theory, Uncertainty, λcontingency coefficient
PDF Full Text Request
Related items