Font Size: a A A

Study On Similarity-based Text Clustering Algorithm And It's Application

Posted on:2010-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:L P CengFull Text:PDF
GTID:2178360275951087Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text Clustering is an important branch of Text Mining,which has get more depth research because of its unique knowledge discovery functions.Today,there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing,the organization of search results and digital library services.However,with expansion of document sets,traditional text clustering algorithm encountered a number of insurmountable difficulties.For instance, algorithm ignores the semantic correlation between words,the instability of result. These papers mainly for the above problems do some research on text clustering.First,we introduce the traditional text clustering algorithms.We compare and analyze the traditional text clustering algorithms.Secondly,to solve the vector space model ignoring the semantic correlation between words,we propose a text clustering algorithms based on word similarity(TCWS).Due to the traditional K-Means algorithms have an shortcoming of clustering results instability,we propose a K-Means algorithms based on average similarity of text(KAAST).Finally,research results be applied to public security information system.The works in this article as follows:(1) Introduced to the traditional text clustering algorithm,and they were compared and analyzed from the scalability,multi-dimensional,dealing with high dimensional data and so on.(2) We propose a text clustering base on words similarity algorithm.First of all, TCWS algorithm use of word similarity classification of words,access to word semantic relevance between words,and then make use of the word classification as a vector space model category of items with text that reduced dimension of vector space model,finally,used partitioning clustering algorithm.Experiments showed that TCWS algorithm improve the accuracy of clustering results.(3) We propose a K-Means base on average similarity of text algorithm.First of all,structural average similarity of text collection,Secondly,selected from collection of the greatest average similarity of the text as the initial cluster center,at the same time,needs to delete the text which cluster associated with the initial cluster center. Selected initial cluster center not only on behalf of and scattered.Finally,used to the selected center as the initial cluster centers of K-Means algorithm.Experiments showed that KAAST algorithm improve d stability.(4) According to above theory research,the algorithms presented in this article are used to the public security information system,and Design and Implementation of a text clustering system,which can improve efficiency and correctly.
Keywords/Search Tags:Text Clustering, Word Similarity, K-Means, Vector Space Model, Public Security Information
PDF Full Text Request
Related items