Font Size: a A A

Research On Discretization Methods For Continuous Data

Posted on:2013-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y SangFull Text:PDF
GTID:1118330371496666Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the amount of data and the rapid development of information technology, data mining and machine learning have become a hot research currently. At present, a large number of data with continuous attribute values are presented in the real world. However, many classification algorithms in data mining and machine learning are only applied to data with discrete attribute values. Therefore, data with continuous attribute values must be discretized. Otherwise, these classification algorithms do not work properly. To solve this problem, we systematically analyze existing discretization methods of continuous data and study them in-depth from different aspects such as discretization criterion. The main contributions of this dissertation can be summarized as follows:(1) A combined single attribute and multi-attribute bottom-up discretization method is pro-posed. It not only considers the correlations among the attributes, but also synthetically evalu-ates the variance among the adjacent interval pairs. This aims to find the best merged intervals. First, we propose a combined single attribute and multi-attribute discretization criterion, which is derived by minimum description length principle and significance of adjacent interval pairs among continuous attributes. The advantage of the criterion is further analyzed. Furthermore, we develop a heuristic bottom-up discretization algorithm to find the optimal discretization result based on the criterion. Finally, empirical experiments on UCI data sets show that the proposed method significantly improves the learning accuracy of C4.5decision tree and support vector machine classifier compared with existing discretization methods.(2) A discretization method for disposing high-dimensional data based on nonlinear dimension reduction technique is proposed. It solves the discretization problem of high-dimensional data. First, we propose a locally linear embedding algorithm based on local neigh-borhood optimization. It maps high-dimensional data into a low-dimensional space and ensures to keep geometric correlation structure of the original data. This algorithm overcomes the defi-ciency that the geometric correlation structure of the data is easily distorted when mapping data. Second, we propose an area-based chi-square discretization algorithm. It effectively discretizes each continuous attribute in the low-dimensional space by considering the possibility of being merged for each interval pair from the view of probability. The experimental results show that the proposed method yields a better discretization result and more concise knowledge of the data. It improves the learning accuracy of classifiers. In addition, the proposed discretization method has been applied to computer vision and image classification, and achieves a good result.(3) A data discretization method based on improved chi-square statistic is proposed. It im-proves the quality of discretization methods based on statistical independence. First, we analyze the deficiency of the selection of degree of freedom in chi-square function and give a modified scheme for selection of degree of freedom. Second, we propose an improved scheme for ex-pected frequency according to data distribution, which overcomes the deficiency that different datasets have the same expected frequency. This improves the accuracy of chi-square calcula-tion. The experimental results show that the improved method generates higher class-attribute interdependence redundancy value and significantly improves the learning accuracy of C4.5de-cision tree and Naive bayes classifier.
Keywords/Search Tags:Discretization of Continuous Data, Minimum Description Length Principle, High-Dimensional Data, Dimension Reduction
PDF Full Text Request
Related items