Study Of Text Clustering Based On K-Means Algorithm

Posted on:2009-04-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Zheng

Full Text:PDF

GTID:2178360272471072

Subject:Signal and Information Processing

Abstract/Summary:

With the rapid development of internet and intranet, a sharp increase in the number of a variety of electronic text data. How to quickly and efficiently access, manage and use these texts, has become an urgent and important issues in the areas of information systems. In recent years, as one of the basic tools to solve these problems, automatic text clustering technology based on the content of the text has undergone an unprecedented development, which has aroused widespread concern.The goal of text clustering is to dividing the text of the document collection into several clusters, which requires the similarity of the same clusters within the content of the document as big as possible while the similarity between the different clusters as small as possible. As an important application in text mining, text clustering has become a hot research.This paper first introduced the background of the text mining research, research significance, and research related to the basic theory of knowledge.Second, it analyzed and studied the text of the pretreatment process, focused on word segmentation problems for Chinese text. It adopted the maximum match algorithm in the word segmentation, with back to a word and the method based on word frequency to find and dispel word ambiguity .It discussed the characteristics of expression and choice of features for pre-text, used Vector Space Model (VSM) presenting the text and used the evaluated function tfidf to choose the text features.Then, For the Chinese text clustering, it used twice text clustering method based on k-means algorithm.First, it applyed k-means in texts clustering while choose the value of k from a certain range that maximum the average silhoustte coefficient and the selection of initial center is by a method based on Sample density.At the same time,experiment showed that the feasibility of the two methods used to determine the initial parameters.For the result of first clustering,if a cluster contained the number of samples much higher than the number of samples that the other clusters contained,then re-cluster the cluster.Finally, this paper designed a text clustering system, and tested the twice clustering effect for Chinese text in this paper.Test results show that as an experimental system, the main indicator of the performance of the basic satisfactory.

Keywords/Search Tags:

Text Clustering, Maximum Match, K-Means, Silhoustte Coefficient

Related items

1	The Research Of Clustring Analysis's Application In Web Text Mining
2	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
3	Text Clustering Based On K-means Algorithm And Realization
4	Improvement Of K-Means Algorithm And Its Application In Weibo Topic Discovery
5	Research And Implementation Of Text Clustering Based On Fuzzy C-Means Clustering Algorithm
6	Design And Implementation Of Distributed Text Clustering System Based On K-means
7	Research On Text Clustering Algorithm Based On K - Means
8	Based On The Text Of The K-means Clustering Analysis
9	Based On K-means The Chinese Text Clustering Algorithm
10	Research On Text Clustering And Its Application In Topic Detection Analysis