Font Size: a A A

Semi-Supervised Text Clustering Based On Feature Weighting

Posted on:2012-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2218330338468489Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology and information network technology, social information increased dramatically, database scale expanded daily , the data amount and capacity of the database is also rapid sharply, these broad areas provides a broad stage for the application of cluster analysis and researches. However, in many practical applications, we get the data but also can get some prior knowledge of these data, but in the process of the traditional clustering algorithm, these priori knowledge does not considered. Semi-supervised clustering algorithm studied how to use a small amount of supervised information to improve clustering performance in unsupervised learning.Semi-supervised clustering is a new research direction of machine learning in recent years, but also an important branch of data mining, and gradually become a useful tool in many areas. However,the research for semi-supervised clustering now, especially when a small amount of labeled data are insufficient to reflect the complete structure with the large number of unlabeled data cluster, its clustering accuracy is not good.At first, paper introduces the background of the semi-supervised clustering , current situation and research significance, elaborating commonly used clustering methods, currently several feature selection and evaluation criteria etc.simply. Focusing on three semi-supervised clustering algorithm: constraint-based clustering, distance-based clustering, constraint and distance–based clustering. especially constraint-based K-means clustering algorithm, we use experiment describing and demonstrating them simplily.Then in order to improve the accuracy of semi-supervised clustering algorithm, we make a change in constraint-based K-means algorithm, introduced feature weighting,so that the similarity of same class are more large and proved effects on the algorithm in different feature weighted index by experiments.we not only experiment on the single-language data sets to verify this algorithm, but also study the Sino-British data set,in the labeled document contains only chinese or english language,it can cluster whole multi-language data set.the experimental results show that in terms of efficiency and accuracy, comparing with cross-language classification, semi-supervised clustering based on feature weighting shows better performance.
Keywords/Search Tags:Semi-supervised Clustering, Feature Weighting, Multi-language, Text Clustering
PDF Full Text Request
Related items