Font Size: a A A

Research And Improvement Of Text Clustering Based On GloVe

Posted on:2020-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:L XuFull Text:PDF
GTID:2428330590961165Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology,the number of electronic information on the Internet has also increased dramatically.How to quickly and accurately select the required information from massive data has become a major challenge for people.Text clustering is an effective method to solve this problem.How to represent text accurately is an important step in the text clustering process.The vector space model is widely used in text representation.But the traditional vector space model uses feature words as the dimension of the text vector,there are problems such as the dimension of the vector is high and it can't represent the semantic information of the text.Therefore,researchers have proposed the idea of constructing text vectors through word vectors.After analyzing the common methods of constructing text vectors using word vectors,we find that they all have some shortcomings.Therefore,this paper proposes a new method named clustering and weighted word vectors based on jaccard similarity coefficient for text representation(JSC-CW).The method is based on the idea of TF-IDF weighting method and clustering method,and it uses the influence information of words on the text and makes the dimensions of the text vector interpretable,thus improving the accuracy of the text vector and applying it in text clustering.In recent years,researchers have proposed a variety of word vector models based on different ideas.Among them,the word vector based on Word2 vec is widely used in natural language processing,but it only trains the model through the words in the local context window of the text,and does not use the statistical information in the entire corpus.However,the GloVe word vector model makes full use of global statistical information to obtain word vectors and this model is based on the idea of Word2 vec.Therefore,this paper will research the text clustering based on the GloVe word vector model.The affinity propagation algorithm is a new clustering algorithm,and the clustering result of this algorithm is better than the traditional clustering algorithm in some cases,and there is no need to specify the initial cluster center and the number of clusters.However,the algorithm needs to initialize the preference parameters and the result of the last iteration is not taken into account when updating the availability in every iteration.Therefore,this paper proposes an adaptive competitive affinity propagation(AC-AP)algorithm to solve the above problems.However,for large-scale datasets,the execution time of AC-AP is too long,and it tends to generate more than the true number of clusters.Therefore,this paper proposes an algorithm named hierarchical adaptive competitive affinity propagation(HAC-AP)based on AC-AP,which can improve the clustering effect and reduce the execution time of the algorithm in largescale datasets.The experimental results show that the text vector constructed based on GloVe is better than based on Word2 vec in text clustering,and the result of text clustering base on JSC-CW for text representation is better than the traditional methods.However,the result of the HAC-AP algorithm proposed in this paper is better than the traditional clustering algorithm and the affinity propagation,but when the dataset size is small,the clustering effect is slightly worse the affinity propagation.For the execution time of this algorithm,when the dataset size is large,the execution time is much shorter than that of the affinity propagation algorithm,but its execution time is longer than that of some traditional clustering algorithms such as K-means and BRICH;when the dataset size is small,its execution time is longer than other clustering algorithms.
Keywords/Search Tags:text clustering, text vector, word vector, affinity propagation
PDF Full Text Request
Related items