Semi-Supervised Text Clustering Based On Feature Weighting

Posted on:2012-05-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Li

Full Text:PDF

GTID:2218330338468489

Subject:Computer Science and Technology

Abstract/Summary:

With the development of computer technology and information network technology, social information increased dramatically, database scale expanded daily , the data amount and capacity of the database is also rapid sharply, these broad areas provides a broad stage for the application of cluster analysis and researches. However, in many practical applications, we get the data but also can get some prior knowledge of these data, but in the process of the traditional clustering algorithm, these priori knowledge does not considered. Semi-supervised clustering algorithm studied how to use a small amount of supervised information to improve clustering performance in unsupervised learning.Semi-supervised clustering is a new research direction of machine learning in recent years, but also an important branch of data mining, and gradually become a useful tool in many areas. However,the research for semi-supervised clustering now, especially when a small amount of labeled data are insufficient to reflect the complete structure with the large number of unlabeled data cluster, its clustering accuracy is not good.At first, paper introduces the background of the semi-supervised clustering , current situation and research significance, elaborating commonly used clustering methods, currently several feature selection and evaluation criteria etc.simply. Focusing on three semi-supervised clustering algorithm: constraint-based clustering, distance-based clustering, constraint and distanceâ€“based clustering. especially constraint-based K-means clustering algorithm, we use experiment describing and demonstrating them simplily.Then in order to improve the accuracy of semi-supervised clustering algorithm, we make a change in constraint-based K-means algorithm, introduced feature weighting,so that the similarity of same class are more large and proved effects on the algorithm in different feature weighted index by experiments.we not only experiment on the single-language data sets to verify this algorithm, but also study the Sino-British data set,in the labeled document contains only chinese or english language,it can cluster whole multi-language data set.the experimental results show that in terms of efficiency and accuracy, comparing with cross-language classification, semi-supervised clustering based on feature weighting shows better performance.

Keywords/Search Tags:

Semi-supervised Clustering, Feature Weighting, Multi-language, Text Clustering

Related items

1	Semi-supervised Learning On Text Data
2	Research On Clustering Algorithms With Feature Preferences And Their Implementation
3	Research On Text Clustering Based On Semi-supervised Learning
4	Chinese Language Network Statistical Properties Of Semi-supervised Document Clustering Algorithm Research
5	Research On Key Problems In Text Mining
6	A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering
7	Research On Semi-supervised Clustering And Classification Algorithm
8	Research On Semi-supervised Deep Text Clustering Method Combined With User Intentio
9	Research On Key Technology Of Clustering Analysis Optimization
10	Research On Clustering Methods And Their Applications