Study On Similarity-based Text Clustering Algorithm And It's Application

Posted on:2010-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:L P Ceng

Full Text:PDF

GTID:2178360275951087

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text Clustering is an important branch of Text Mining,which has get more depth research because of its unique knowledge discovery functions.Today,there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing,the organization of search results and digital library services.However,with expansion of document sets,traditional text clustering algorithm encountered a number of insurmountable difficulties.For instance, algorithm ignores the semantic correlation between words,the instability of result. These papers mainly for the above problems do some research on text clustering.First,we introduce the traditional text clustering algorithms.We compare and analyze the traditional text clustering algorithms.Secondly,to solve the vector space model ignoring the semantic correlation between words,we propose a text clustering algorithms based on word similarity(TCWS).Due to the traditional K-Means algorithms have an shortcoming of clustering results instability,we propose a K-Means algorithms based on average similarity of text(KAAST).Finally,research results be applied to public security information system.The works in this article as follows:(1) Introduced to the traditional text clustering algorithm,and they were compared and analyzed from the scalability,multi-dimensional,dealing with high dimensional data and so on.(2) We propose a text clustering base on words similarity algorithm.First of all, TCWS algorithm use of word similarity classification of words,access to word semantic relevance between words,and then make use of the word classification as a vector space model category of items with text that reduced dimension of vector space model,finally,used partitioning clustering algorithm.Experiments showed that TCWS algorithm improve the accuracy of clustering results.(3) We propose a K-Means base on average similarity of text algorithm.First of all,structural average similarity of text collection,Secondly,selected from collection of the greatest average similarity of the text as the initial cluster center,at the same time,needs to delete the text which cluster associated with the initial cluster center. Selected initial cluster center not only on behalf of and scattered.Finally,used to the selected center as the initial cluster centers of K-Means algorithm.Experiments showed that KAAST algorithm improve d stability.(4) According to above theory research,the algorithms presented in this article are used to the public security information system,and Design and Implementation of a text clustering system,which can improve efficiency and correctly.

Keywords/Search Tags:

Text Clustering, Word Similarity, K-Means, Vector Space Model, Public Security Information

PDF Full Text Request

Related items

1	Research And Implementation Of Chinese Text Clustering Algorithms
2	Text Classification Based On Word Vector And Topic Vector
3	Research On Text Similarity Algorithm Based On VSM Combined With Word Semantics
4	Research And Implementation Of Text Mining Technology Based On Public Security Information
5	The Design And Implementation Of Automatic Categorization System Of Public Security Information Based On SVM
6	Text Similarity Computing Theory And Applied Research
7	Research On English Text Clustering Method Based On Vector Space
8	Design And Implementation Of The Character Classification System Used In Search Engine
9	Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity
10	Study Of Chinese Text Clustering On Improved K-means Algorithm