Research On Topic Clustering Methods For Multi-source Texts

Posted on:2022-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:W J Xu

Full Text:PDF

GTID:2518306527470374

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology,a large number of documents have emerged on the Internet.Mining valuable topic information from these documents is an important research content in the field of text mining.In this field,the most commonly used algorithm for mining hot topic information in big data is text clustering.Traditional text clustering is mostly oriented to documents from a single source.However,with the increase of information platforms,clustering analysis of singlesource datasets no longer meets the needs of the times.Text clustering has begun to focus on multi-source datasets.The structure of the multi-source dataset is complicated.According to the difference of the topic information of different sources,this paper summarized them as heterogeneous heterogeneity,heterogeneous differences,and heterogeneous similarities.Traditional clustering methods cannot be directly applied to multi-source datasets,because these complex structures make it difficult for users to set a more accurate number of topics before clustering,and they will also cause topic confusion during the clustering process.Therefore,how to solve the problems caused by the complex structure is a research difficulty.In addition,the multi-source dataset contains a large number of short texts,and how to solve the feature sparse problem is also a research difficulty.Aiming at the problems caused by the complex text structure,this paper proposed a Hierarchical Dirichlet Multinomial Allocation Topic Clustering Model(HDMA)based on the Dirichlet Multinomial Allocation(DMA).On the one hand,the HDMA model combines the advantages of DMA to reduce the dependence on the number of topics entered in advance,and can automatically estimate the number of topics in each source during the clustering process.On the other hand,the HDMA model provides independent parameter space for each source of topic information,which can prevent the mixing of topic information from different sources.Experiments showed that the HDMA model has a good clustering effect on multi-source datasets.Aiming at the problem of sparse text features,this paper proposed a Hierarchical Dirichlet Multinomial Allocation Topic Clustering Model with Latent Features(LFHDMA)based on the HDMA model.The model adds two semantic latent feature matrices,and provides additional semantic information for each dataset with the help of word vectors trained from a large-scale corpus,reducing the impact of feature sparseness on clustering.Experiments showed that the LFHDMA model can effectively improve the clustering effect on multi-source text data sets.

Keywords/Search Tags:

Text clustering, Multi-source documents, Topic model, Gibbs sampling

PDF Full Text Request

Related items

1	Reasearch On The Topic Clustering Of Network Short Text
2	Research On Short Text Topic Discovery Based On BTM Topic Model
3	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
4	Document Clustering Method Based On LDA Topic Model
5	Design And Implementation Of Clustering Software For Expert Research Interests
6	Research And Implementation Of Multi-source Text Topic Detection Based On Fusion Clustering
7	The Research And Implementation Of Topic Evolution Based On LDA
8	Research On Topic Models Combining Internal Feature And External Information Of Texts
9	Research On Learning Methods Based On Topic Model And Its Application In User Portraits
10	Research On The Distribution Characteristics Of Flora Information Based On The Probabilistic Topic Model