Font Size: a A A

Research On Text Clustering Algorithm Based On Latent Semantic Indexing

Posted on:2009-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2178360245988741Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive increase of text information on the Internet and the maturity of search engine technology, the main social problem faced by human is not the information deficiency any more, but how to improve the efficiency of information access. With its flexibility and ability of process automation, text clustering has become an indispensable medium to effectively organize and navigate massive text information.On the basis of the profound research on the whole process of the text clustering technology and by use of the superiority of the LSI (Latent semantic indexing) on the field of semantic meaning and dimension reduction, this thesis, taking the currently popular K-means text clustering algorithm as a frame, presents the exploration and study on the application of LSI in the text clustering technology. It is the aim of this thesis to study the text clustering algorithm with higher efficiency, and try to better cluster the text from the point of word sense.In general, the main work of the thesis is as follow:Firstly, as the most important basis of the text clustering technology, a great deal crucial techniques of text pretreatment directly determine the final clustering result. Some key problems are discussed profoundly and systematically in the thesis, including the keyword extraction and text vectorization, which sets up a stable basement for the clustering practice later.Secondly, as to the problem of instable results of the K-means algorithm, the author improves the algorithm. The main improvements include: (1) choosing automatically the parameter k of the algorithm by applying minimum and maximum principle; (2) calculating the cosine similarity of the vectors as the degree of similarity of texts instead of counting the Euclidean distance of vectors; (3) and using iterative convergence conditions to get stable results even under the case of randomly selecting the initial points.Thirdly, as an important application of natural language, text clustering has the characteristics of high dimension and semantic meaning. As a result, except the choice of text clustering algorithms, the factors affecting result in text clustering also include semantic processing and dimension reduction. How to reduce the space dimension effectively has become the focus and difficulty in text clustering. The thesis, applying the LSI into the text clustering, adopts SVD (Singular value decomposition) and SDD (Semi-discrete decomposition) techniques respectively to decompose the vector file from text pretreatment, and clusters the vector space after eliminating noise, therefore verifies the validity of the approach.Finally, the author makes a number of clustering experiments on many pretreated corpus, and analyzes the corresponding results. The experiment results prove the effectiveness of the improved clustering algorithm. Consequently, the experiment shows that the text clustering algorithm based on LSI rounds the rules of the existing word, and realizes much better the natural language comprehension by combining the word rules and statistics.
Keywords/Search Tags:Text clustering, LSI (Latent semantic indexing), SVD (Singular value decomposition), SDD (Semi-discrete decomposition), K-means algorithm
PDF Full Text Request
Related items