Font Size: a A A

Research On Hot Topic Discovery Based On Mixed Text Sets Clustering

Posted on:2022-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:L Q ZhuFull Text:PDF
GTID:2518306785975859Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile network and terminal equipment,more and more people participate in the network construction,followed by a large amount of behavior data,such as the evaluation after online shopping,film review after watching movies and so on.In the era of big data,through the analysis and mining of these large amounts of user behavior data,it plays a very important role for enterprises to make targeted promotion activities,for the government to accurately grasp the development direction of network events and make positive guidance in time.Text clustering is an important unsupervised data mining method,which has been widely used in hot topic discovery,event tracking and document summarization.Although there are many related researches on text clustering,they all focus on the research of long text or short text.In today's increasingly complex network environment with the intersection of long text and short text,using traditional research methods to cluster mixed text sets has the following problems:(1)the vector representation of long text for short text will produce sparse representation,It is easy to cause semantic gap between samples and inaccurate semantic representation,thus affecting the accuracy of downstream clustering.(2)Traditional research methods usually extract local feature information,which is not global.Or on the contrary,they extract the text topic and ignore the local feature,which leads to inaccurate representation of text information.(3)The initialization problem of cluster center still exists in clustering algorithm,and different initialization results have an important impact on clustering accuracy.(4)The traditional clustering algorithm is to extract the features of the text representation,and then use the clustering algorithm to get the clustering results directly,so it can only output in one direction,which is not conducive to the adjustment of the clustering center with the change of data distribution.In view of the above problems,this paper proposes the following methods,which can alleviate the impact of the above problems on the clustering of mixed text sets to a certain extent.(1)In order to improve the representation ability of mixed text set,the micro vector representation of word embedding,word order embedding and macro vector representation of topic embedding are proposed.The coder part of Auto Encoder is used to extract features,and then the decoder is used to reconstruct the above features.Through this self supervised training method,the feature extractor which is beneficial to the downstream text clustering is trained.(2)The two stage clustering method of "coarse" and "fine" is proposed.Since clustering algorithm is very sensitive to the number of initial cluster centers and clusters,coarse clustering is used to initialize "fine" clustering.In the "coarse" clustering stage,the number of clustering centers and clusters is calculated by canopy algorithm,and the clustering results are taken as the initial value of the "fine" clustering in the next stage,and then the samples are clustered by K-means clustering algorithm.(3)According to the clustering results,the parameters of feature extraction module and clustering module are adjusted reversely to improve the consistency between the distribution of samples obtained by clustering and the distribution of original samples,so as to improve the accuracy of clustering.The results on two experimental data sets show that the proposed method can solve the problem of text representation in mixed text sets,which leads to poor clustering effect.At the same time,this paper crawled the hot topics of microblog to obtain the text data of seven hot topics,and trained the model with some of the data.The clustering method proposed in this paper shows good results in the test set,and can basically accurately find the related topics and gather them into a class.
Keywords/Search Tags:Text clustering, Auto Encoder, Feature extraction, Neural network
PDF Full Text Request
Related items