Font Size: a A A

Research On Clustering Methods And Their Applications

Posted on:2011-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2198330332981233Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many researchers have exerted themselves into researches on clustering, which is a very active research task in machine learning and data mining. Traditional methods of machine learning can be divided into supervised methods and unsupervised methods and clustering is a representative of unsupervised methods. Choosing an appropriate similarity measurement is very significant to guarantee the quality of clustering. However, the similarity measurement based on distance which was used by most of traditional clustering algorithms is not applicable to high-dimensional datasets or datasets with different types of attributes. Developing a new method which can be applicable to high-dimensional datasets and can deal with all types of data without any measure of similarity has been increasingly concerned. Moreover, traditional methods of machine learning can only deal with labeled data or unlabeled data respectively whereas datasets with both labeled data and unlabeled data is much familiar in real world and how to make use of these data has become a hot topic recently. Semi-supervised Learning, which can use both the labeled data and unlabeled data, emerged as the time required. Many typical clustering methods have been extended to their "semi-supervised" editions and many experiences have expressed that expanding clustering to semi-supervised domain is an efficient approach to help to solve the problem mentioned above.In this thesis, first of all, a novel similarity measure based on spatial overlapping relation, who calculates the similarity between a pair of data points by using the mutual overlapping relation between them in a multi-dimensional space, is proposed in allusion to the limitation of distance based similarity measure which had been used by most of traditional clustering method. A spatial overlapping based similarity measure applied to hierarchical clustering was implemented then. Furthermore, the clustering methods were extended to different domains of semi-supervised learning and three corresponding algorithms had been designed and developed. The first one was spatial overlapping based semi-supervised feature selection, which was applied to high-dimensional datasets with very few labeled data. As to the problem of the feasibility of applying traditional supervised classifications which need to label the data partly or fully manually to the ubiquitous large datasets was low, a novel semi-supervised classification applied to large data sets was proposed. Finally, a novel dual ensemble based semi-supervised feature selection method was proposed to deal with the problem of robustness or stability of feature selection techniques as well as the use of "cheaper" unlabeled data in several practical applications.
Keywords/Search Tags:Clustering, Spatial Overlapping, Similarity Measurement, Semi-Supervised Classification, Semi-Supervised Feature Selection, Dual Ensemble
PDF Full Text Request
Related items