Font Size: a A A

Unsupervised Discretization Algorithm Based On Ensemble And Application On The Similarity Of Data Set

Posted on:2016-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XuFull Text:PDF
GTID:2308330476952167Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, pattern recognition is widely used in real life.But some data mining algorithms can only deal with discrete values, while many data sets consist of continuous data values in the real world. The effectiveness of machine learning is affected directly.Depending on whether considering the label information of data sets, discretization can be divided into supervised and unsupervised method, of which the research of former is deep and result of discretization is accurate, while the later remains challenging. In this thesis, an unsupervised discretization based on ensemble learning is proposed and applied to clustering algorithm selection analysis.The main idea of the unsupervised discretization algorithm based on ensemble is as follows: Firstly,k-means method is employed to partition the data set into multiple subgroups to obtain the label information. A supervised discretization algorithm is then applied to divide the labeled data sets. When the two processes above are repeatedly executed, multiple discrete results are obtained. The ensemble learning is applied in next step so that the minimum subgroups are obtained. Finally, the minimum sub-intervals are merged according to the similarity between the neighbored data, and two effective stopping criteria are proposed to stop the merger process. In the proposed merger process, the neighborhood relation is considered so that the intrinsic structure of the data set is maintained. In order to verify the accuracy of the algorithm, the discrete data can be used to clustering algorithm such as spectral clustering, and then to evaluate the effectiveness of clustering. The experiment results demonstrate the feasibility and effectiveness of the proposed methods. Its clustering accuracy improves by about 33% on average than other four methods.Clustering analysis is an important tool in data mining and other related areas.Clustering analysis is an important tool in data mining and other related areas.As clustering is an ill-posed problem, there exist many clustering algorithms in the literature. In general,however, one clustering algorithm favors only a few data sets and requests some user-specified parameters,while users have no priori knowledge about their data. Therefore, how to choose a suitable clustering algorithm is a tough problem. To solve this problem in some degree, we define a similarity measure of data sets with the unsupervised discretization scheme mentioned above, and propose a framework to select clustering algorithms. The main idea can be summarized as follows: Firstly, a classical clustering algorithm space and a typical data set space are established and the mapping relation between them is created. Then the data sets are binarized(with the proposed unsupervised discretization method), and the clustering oriented stability is analyzed. The similarity of feature vectors is calculated and k nearest data sets of the given data set are identified. Finally, the clustering algorithms of nearest neighbor are recommended to the given data sets. The 7 different typical clustering algorithms are selected for extensive experiments. The result of clustering of given data set is accurate. The result shows that the algorithm is effective.
Keywords/Search Tags:ensemble learning, categorical data, similarity measurement, stability, binarization, clustering algorithm recommendation
PDF Full Text Request
Related items