Font Size: a A A

Researching The Kernel Clustering Algorithm And Its Application In Text Clustering

Posted on:2015-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q XuFull Text:PDF
GTID:2298330422488481Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet and continuous improvement of network technology,Internet has become the world’s largest and richest information repository. However, whenusers query information, they are often drowned and lost in the sea of information, whichgreatly reducing the retrieval efficiency.Text clustering technique is an effective way to solve the visualization and manage thevast amounts of textual information. Since text clustering is not required categories ofinformation and can automatically complete text grouping, so it is widely used ininformation retrieval in recent years. There are many classic clustering methods such asC-means clustering and fuzzy C-means clustering, which can only work for some typicaldistribution of the samples. They directly use the characteristics of the sample to clusteringwithout optimizing for the characteristics of the samples. The effectiveness of clusteringmethods depends largely on the effectiveness of the distribution of the samples. However, incertain larger scatter of samples and in certain smaller scatter of samples, the effectivenessof these methods is relatively poor. Due to feature vector inner product in high-dimensionalspace can be calculated directly by the kernel function with a low-dimensional space of theinput vectorThe main idea of kernel clustering method is through a non-linear mapping. Thepurpose is to map the input space data points to a high dimensional feature space and selectthe appropriate Mercer kernel function instead of the product of the nonlinear mapping,which can clustering in the feature space, so the computation does not increase with thenumber of dimensions.In this paper, understanding the basic theory of kernel method and combining with theentropy theory, we propose the subspace samples selection based on kernel FCM andmaximum entropy fuzzy C-means clustering based on sample weighting and initial clustercenters (WKMEFCM). Finally this paper applies them to text clustering. Experimentsconfirmed that, since the introduction of Mercer kernel function, the originals, which do notshow the characteristics, can stand out, so the clustering results are better for the distributionof confusion and difficult to draw highly relevant text data.Finally, based on open source Carrot2, this paper builds a Chinese text clustering Websearch system and implements clustering for search results. For Chinese characteristics, calculated on the weight of features, this paper not only considers the traditional termsfrequency and documents frequency, but also combines the parts of speech and wordsposition in the text, so that the weight of credibility is increased. The proposed WKMEFCMalgorithm is applied to the system, the assessment shows that the system is further improvedthe efficiency of information retrieval.
Keywords/Search Tags:Text Clustering, Kernel Function, Subspace Samples Selection, MaximumEntropy Clustering, Feature Weighted
PDF Full Text Request
Related items