Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity

Posted on:2019-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:K P Yang

Full Text:PDF

GTID:2348330563454163

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet in China,a large amount of Chinese text information appears on the Internet.Facing the massive Chinese text information,how to find the needed information quickly is an urgent problem.Text clustering method,as a kind of clustering method,can help us find out the rules of data from the massive text information.In this thesis,we try to solve the problems that exist in the Chinese text clustering.Firstly,the problem of incomplete description of semantic similarity in Chinese texts is studied.The traditional editing distance similarity and cosine similarity ignore the fact that there are lots of synonyms and homoionyms in Chinese texts.Based on word similarity and text length,we construct a text similarity calculation algorithm in this thesis that tries to overcome the problem that the traditional similarity algorithm does not fully consider the semantic considerations,which leads to the text similarity accuracy is not enough,and achieve to describe the synonyms and homoionyms sentence semantic similarity more accurately.The word similarity calculation in this algorithm is based on the word2 vec algorithm that trains the Chinese text corpus first of all and then gets the word vector of the words in the corpus and lastly calculates the word similarity value according to the word vector.Experiments show that the improved text similarity algorithm is better than the cosine similarity and edit distance similarity.Secondly,we try to improve the K-means clustering algorithm to overcome the problems that the number of clusters is determined subjectively and the selection of initial clustering centers is selected randomly.Although K-means clustering algorithm is widely used in the Chinese text clustering algorithm,the results of the accuracy and stability by the K-means are low and bad at present.Therefore,we try to adopt maximum spacing method to select the initial clustering center and dynamically the adjusting the number of clusters in order to improve the K-means algorithm,which gets over the two shortcomings of the K-means algorithm and improves the stability and practicability of the algorithm.The experiments show that the improved K-means algorithm can better identify the number of clustering categories.Compared with traditional LSI and LDA,the accuracy of clustering is also improved.

Keywords/Search Tags:

Chinese text clustering, semantic similarity, clustering algorithm, word vector, K-means

PDF Full Text Request

Related items

1	Study On Similarity-based Text Clustering Algorithm And It's Application
2	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
3	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
4	Research And Implementation Of Chinese Text Clustering Algorithms
5	Study Of Chinese Text Clustering On Improved K-means Algorithm
6	Research On Chinese Spam Filtering Based On Semantic Body And Text Clustering
7	Clustering Algorithm Research Of Short Text Based On Semantic Similarity
8	Research On The Key Techniques Of Chinese Text Clustering
9	Research Of Feature Vector Value Weighted Based On Semantic Analysis In Chinese Text Clustering
10	A Chinese Text Clustering Without Dictionary Based On The Improved Fuzzy C-Means Algorithm