Font Size: a A A

A Study On The Clustering Of Chinese Web Text

Posted on:2010-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:W M YuFull Text:PDF
GTID:2178330338975820Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,Web text information showing a trend of explosive growth. How to indexing, retrieval, management and mining of mass web text data on the web becomes a big challenge for computer science. The emergence of texting clustering technology provides a valid path for classification management and visualization of massive texts. As an unsupervised machine learning method, text clustering has been widely used in many Web applications such as information retrieval, automatic multi-text summarization and etc.This paper makes discussion in the field of Chinese Web text clustering, on the basis of reviewing the existing academic achievements and the latest study findings at home and abroad, deeply study of this technology in the application of two typical scenarios: (1) Clustering massive number of Chinese texts in News portal. (2) Clustering results returned by Chinese Web search engines in real time.In the first scene,we design a series of text clustering related distributed algorithms on MapReduce which is a distributed parallel computing framework. In the text pre-processing stage, we design a new iterative algorithm to calculate tfidf weight on MapReduce in order to evaluate how important a word is to a text in a corpus. In the text clustering stage, we first divide the text corpus into overlapping subsets called"canopies"using the approximate distance measure. On the basis of previous step, a new distributed K-means text clustering is designed on MapReduce using a rigorous and thus more expensive distance metric. Finally, we implement a Chinese text clustering distributed system based on the new improved algorithms on MapReduce. This system can run mass text clustering tasts efficiently and stably.The experiment results on real Chinese corpus demonstrate that our algorithm can deal with the problem of clustering on large text set efficiently, and approximately linear grow in required running time with increasing corpus size within a certain range. The quality of clustering is satisfied.In the second scene,we combine weight calculation in Vector Space Model(VSM), present a suffix tree based method for Chinese snippets clustering. Firstly, we obtain meaningful words (always noun and verb in Chinese) from each sentence of snippets by Chinese word segmentation in the text pre-processing stage. After construction of Chinese suffix tree with a linear algorithm, we ignore the nodes (feature phrases) with too high document frequency (df ), and choose the phrases with high score given by a formula we proposed as text features. Then we redefine the pair-wise text similarity measure method for Chinese snippets using the new text features. With combination of the text features extracted based on suffix tree and the new pair-wise text similarity measure method, the AHC algorithm is realized in almost real time. The experiment results show that the new method can improve the quality of clustering, and the speed can meet the demand of"on the fly"application.In addition to offering referenced value and practical value of experiences and findings in our experiments, our work provides an example that could be useful for text clustering problems in specific scenarios.
Keywords/Search Tags:text clustering, K-means, suffix tree, MapReduce
PDF Full Text Request
Related items