Font Size: a A A

Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity

Posted on:2019-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:K P YangFull Text:PDF
GTID:2348330563454163Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet in China,a large amount of Chinese text information appears on the Internet.Facing the massive Chinese text information,how to find the needed information quickly is an urgent problem.Text clustering method,as a kind of clustering method,can help us find out the rules of data from the massive text information.In this thesis,we try to solve the problems that exist in the Chinese text clustering.Firstly,the problem of incomplete description of semantic similarity in Chinese texts is studied.The traditional editing distance similarity and cosine similarity ignore the fact that there are lots of synonyms and homoionyms in Chinese texts.Based on word similarity and text length,we construct a text similarity calculation algorithm in this thesis that tries to overcome the problem that the traditional similarity algorithm does not fully consider the semantic considerations,which leads to the text similarity accuracy is not enough,and achieve to describe the synonyms and homoionyms sentence semantic similarity more accurately.The word similarity calculation in this algorithm is based on the word2 vec algorithm that trains the Chinese text corpus first of all and then gets the word vector of the words in the corpus and lastly calculates the word similarity value according to the word vector.Experiments show that the improved text similarity algorithm is better than the cosine similarity and edit distance similarity.Secondly,we try to improve the K-means clustering algorithm to overcome the problems that the number of clusters is determined subjectively and the selection of initial clustering centers is selected randomly.Although K-means clustering algorithm is widely used in the Chinese text clustering algorithm,the results of the accuracy and stability by the K-means are low and bad at present.Therefore,we try to adopt maximum spacing method to select the initial clustering center and dynamically the adjusting the number of clusters in order to improve the K-means algorithm,which gets over the two shortcomings of the K-means algorithm and improves the stability and practicability of the algorithm.The experiments show that the improved K-means algorithm can better identify the number of clustering categories.Compared with traditional LSI and LDA,the accuracy of clustering is also improved.
Keywords/Search Tags:Chinese text clustering, semantic similarity, clustering algorithm, word vector, K-means
PDF Full Text Request
Related items