Font Size: a A A

Text Mining Based On Clustering Algorithm

Posted on:2021-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZouFull Text:PDF
GTID:2428330626955919Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet technology,the phenomenon of network data interaction is more and more frequent,and the amount of interactive data also presents an exponential growth.Text data is the main presentation of these interactive data,and in daily life,the text data format we are most exposed to is the short text data format.Under such a background,how to excavate the correlation relationship behind these massive short text data is of great significance for text data organization,text data classification,and research and development of text data-based recommendation system.Since clustering technology can find the potential correlation between data and form the corresponding cluster structure,the clustering problem of the short text has become the focus of our attention.Due to the lack of word items,the feature extraction of short text data is difficult.When the traditional text clustering model is applied to the clustering problem of the short text,it is often unable to get an effective cluster structure,which brings a bad effect to the subsequent application research.Word2 Vec word vector model can take advantage of the context information to the center of center term is converted into words and words a word on the space vector,compared to the traditional vector space model it joined the semantic word vector in training environment,the influence of the reflected certain superiority,thought that context Word2 Vec word vector model similar to the word of semantics should be similar,making similar semantic word,its word corresponds to the space vector closer also.Based on LSA model and PLSA model,the LDA theme model can extract document-topic information and theme-term information based on document-term information.The subject words can reflect the underlying information of the text data to some extent,which is helpful for the clustering of the short text.Aiming at the inapplicability of traditional text clustering model in the clustering of short texts,this paper proposes an improved text representation method based on word2 Vec word vector model and LDA theme model,and an improved k-means clustering algorithm based on LDA theme model.The comparison of simulation results proves that the clustering effect of the improved text representation method on the headline data set is better than that of the text representation method using word2 vec word vector sum average and that of the text representation methodusing word2 vec word vector model combined with tf-idf frequency.It proves that the clustering effect of the improved k-means algorithm on the data set of toutiaoqiao news is better than that of the unimproved k-means algorithm and k-means++ clustering algorithm.
Keywords/Search Tags:short text clustering, word2vec, LDA theme model, k-means algorithm, k-means++ algorithm
PDF Full Text Request
Related items