Font Size: a A A

Research On Deep Clustering Algorithm For Public Opinion Text

Posted on:2022-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:H Z WangFull Text:PDF
GTID:2518306572959919Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Public opinion is a concentrated reflection of public opinion.Internet public opinion has a wide range of sources,a fast spreading speed,and a large amount of diversity.In order to help government agencies and social media to more effectively grasp the development trend of public opinion in the massive public opinion texts,it is necessary to accurately divide the public opinion texts based on the information contained.Compared with the supervised method,the unsupervised clustering algorithm does not require a large amount of labeled data,and can obtain relatively ideal performance under the condition of low labor cost.It is an effective method of public opinion division.In recent years,deep clustering algorithms that integrate representation learning and clustering objectives have received widespread attention and achieved outstanding performance.Existing deep clustering algorithms all focus on how to build a general framework to improve the effect of multiple clustering subtasks.Although these methods have achieved good performance on each subtask,there is still room for improvement in their clustering performance for specific subtasks.Therefore,this paper focuses on the text clustering subtask,for static public opinion text data,and studies an improved method of deep clustering divided by cases.Since there are few data sets for the task of public opinion text clustering,this paper crawls and cleans public opinion documents,and constructs a public opinion text dataset divided by cases.This paper points out that the core challenge of deep clustering methods for public opinion text data is the sparseness of short text,the fusion of text information and deep clustering algorithms.Therefore,a deep clustering bottleneck analysis method based on representation sparsity and entity replacement is proposed.On the one hand,in order to verify the impact of short text sparsity on deep clustering methods,this paper "sparses" the feature representations of Reuters news dataset,and finds that sparse feature representations significantly reduce the performance of existing deep clustering methods.On the other hand,in this study,public opinion information is an important component of text information.A key task is to explore the form of public opinion information in the text.This paper assumes that entities contain part of public opinion information and verify it: The entity phrases of the public opinion text are replaced with non-entity words at random and equal amount,which obviously reduces the performance of the existing deep clustering algorithm on the public opinion text dataset.This paper proposes a feature enhancement method of public opinion short text based on the similarity of entity set.The goal is to use the public opinion information contained in the entity to strengthen the feature representation of public opinion text and alleviate the sparseness of public opinion short text.This method has two key ideas: self-replication and entity-based similarity measurement.Self-replication is performed at the sentence level.A sentence in the document is randomly selected,copied and added to the document.It overcomes the challenge of uneven distribution of the number of words in the dataset,so that documents with different labels have better discrimination when using Euclidean distance,and long text documents and short text documents with the same label have a closer distance.The sentence-level similarity measurement based on the entity set uses the conclusion that the entity contains part of the public opinion information,and aims to use the entity to perform feature selection,find the information that is conducive to clustering,and strengthen the document representation.Experiments show that the feature enhancement method of public opinion short text based on the similarity of entity sets can effectively improve the performance of multiple deep clustering methods.This paper proposes an latent topic self-supervised method based on topic model,which aims to fuse document-topic information into the representation learning of deep clustering based on autoencoders,guide pre-training and self-supervised training,reduce the inconsistency of the two stages,and obtain a cluster-friendly representation.This method has several key components: autoencoder;documenttopic distribution acquisition;using document-topic distribution to guide autoencoder pre-training;dual auxiliary target distribution self-supervised training."Autoencoder" is a common component of deep clustering networks.It uses unsupervised reconstruction tasks to capture the hidden layer representation of input features."Document-topic distribution acquisition" uses the non-negative matrix factorization method to obtain the document-topic distribution with the same number of topics as the number of clusters."Using document-topic distribution to guide autoencoder pre-training" refers to combining the reconstruction task loss and the KL divergence loss between the multi-classification layer output designed in this paper and the document-topic distribution to jointly guide the autoencoder to perform representation learning."Dual auxiliary target distribution self-supervised training" refers to using the proposed target distribution G,which is calculated from the captured document-topic information,and the deep clustering common target distribution P,calculate the loss,guide the self-supervised training.Experiments on the public opinion text dataset show that the latent topic self-supervised method effectively integrates document-topic information into the representation learning of deep clustering based on the autoencoder,obtaining a cluster-friendly representation and improving clustering performance.The generalization verification of the method on the THUCNews dataset and Stackoverflow dataset proves that the method has a certain generalization ability on the general long text and short text datasets.Finally,this article discusses the future research work in this research direction.Potential work includes: research the network and loss equivalent to the short text feature enhancement method of public opinion based on the similarity of the entity set;research the technology which is similar to the latent topic self-supervised method based on topic model but can be used for general clustering tasks or other task-specific clustering tasks.
Keywords/Search Tags:clustering, deep clustering, text clustering, public opinion text
PDF Full Text Request
Related items