Font Size: a A A

Quality Evaluation Of Text Clustering Results And Investigation On Text Representation

Posted on:2006-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z T ZhouFull Text:PDF
GTID:2178360185496996Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the continual development and application of Computing Technology, the quantity of digitized text documents unceasingly grows and the development of WWW intensified the inflation of digital text. So utilizing the clustering analysis technology to simplify the representation of mass text data, reorganize the results from search engines, speed up the information retrieval and recommend customized information would be promising applications. In the investigations on text clustering, there are many algorithms that are hard to be compared and chosen, so quality evaluation of text clustering is important. But how to evaluate the quality of a clustering result is different and confused. There is lack of a general approval evaluation way and some more deeper investigations. So it is without scientific foundation when choosing a text clustering algorithm or setting the algorithm parameters. During the research and the application development, what evaluation indexes we can use, which one is good, which one is bad and which one is more better, are problems need to be solved.These problems boil down to how to evaluate the quality of the results from text clustering algorithms. This thesis focused on the topic of the evaluation of text clustering algorithm and improvement of the quality of text clustering. It has finished the work on the clustering evaluation and text representation models including:(1) Factor analysis on which would affect text clustering. Giving the detailed introduction and discussion of there kinds of factors including: text representation models, ways to measure the distance or similarity, clustering algorithms.(2) Introduction of two classes of evaluation indexes. Distinguish the application scene of these two kinds of indexes. Pay much attention to the first class, which is based on man opinion. Clarify the feature of each indexes of this class;(3) Development of a software package including Chinese and English text parsing, clustering algorithms, evaluation of clustering results. In this package, several text representation models and several basic text clustering algorithms are implemented.(4) Experiments on evaluation of the hierarchical agglomerative clustering algorithm, k-means algorithms, comparison of three basic clustering algorithms etc. We gave a good explanation of the big cluster phenomena and found that fault parameter setting would cause a big decline of clustering quality. The limitation of the text representation model currently used is the fundamental reason of the big cluster phenomena.(5) Tentative investigation of the text representation models. Text representation model is the basic of text clustering. We analyzed the characteristic and limitation of the Vector...
Keywords/Search Tags:text clustering, clustering analysis, clustering evaluation, text representation
PDF Full Text Request
Related items