Text Mining Based On Clustering Algorithm

Posted on:2021-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zou

Full Text:PDF

GTID:2428330626955919

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of mobile Internet technology,the phenomenon of network data interaction is more and more frequent,and the amount of interactive data also presents an exponential growth.Text data is the main presentation of these interactive data,and in daily life,the text data format we are most exposed to is the short text data format.Under such a background,how to excavate the correlation relationship behind these massive short text data is of great significance for text data organization,text data classification,and research and development of text data-based recommendation system.Since clustering technology can find the potential correlation between data and form the corresponding cluster structure,the clustering problem of the short text has become the focus of our attention.Due to the lack of word items,the feature extraction of short text data is difficult.When the traditional text clustering model is applied to the clustering problem of the short text,it is often unable to get an effective cluster structure,which brings a bad effect to the subsequent application research.Word2 Vec word vector model can take advantage of the context information to the center of center term is converted into words and words a word on the space vector,compared to the traditional vector space model it joined the semantic word vector in training environment,the influence of the reflected certain superiority,thought that context Word2 Vec word vector model similar to the word of semantics should be similar,making similar semantic word,its word corresponds to the space vector closer also.Based on LSA model and PLSA model,the LDA theme model can extract document-topic information and theme-term information based on document-term information.The subject words can reflect the underlying information of the text data to some extent,which is helpful for the clustering of the short text.Aiming at the inapplicability of traditional text clustering model in the clustering of short texts,this paper proposes an improved text representation method based on word2 Vec word vector model and LDA theme model,and an improved k-means clustering algorithm based on LDA theme model.The comparison of simulation results proves that the clustering effect of the improved text representation method on the headline data set is better than that of the text representation method using word2 vec word vector sum average and that of the text representation methodusing word2 vec word vector model combined with tf-idf frequency.It proves that the clustering effect of the improved k-means algorithm on the data set of toutiaoqiao news is better than that of the unimproved k-means algorithm and k-means++ clustering algorithm.

Keywords/Search Tags:

short text clustering, word2vec, LDA theme model, k-means algorithm, k-means++ algorithm

PDF Full Text Request

Related items

1	Research And Its Application Of Web Short Text Clustering Method Based On K-Means Algorithm
2	Study Of Chinese Text Clustering On Improved K-means Algorithm
3	Text Clustering Based On K-means Algorithm And Realization
4	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
5	Research And Implementation Of Text Clustering Based On Fuzzy C-Means Clustering Algorithm
6	Based On K-means The Chinese Text Clustering Algorithm
7	Design And Implementation Of Text Information Recommendation System Based On Short Text Processing Algorithm Optimization
8	Research And Implementation Of K-means++ Algorithm Improvement And Search Application Based On Latent Semantics
9	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel
10	Analysis Of Network Public Opinion Data Based On Short Text Clustering