Font Size: a A A

Research On Multi-source Text Topic Mining Algorithm

Posted on:2020-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:L Y XuFull Text:PDF
GTID:2428330596473185Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,people need to obtain text information from various network channels every day.Therefore,the processing of text information from multiple sources has become a very important task.Most of the traditional topic mining models are designed for single-source text data.For the existing text data sources,the traditional model is difficult to effectively apply to such multi-sources due to the more complex form of data.Text data from different sources has certain similarities in the distribution of topic information,but there are obvious differences in the vocabulary features of the theme.The traditional models can't make good use of the relevance of the topic knowledge of multi-source data.It is difficult to resolve the differences in the representation of the same topic in different sources.In order to better understand information of multi-source text,we propose a novel topic model for multi-source text data based on Dirichlet Multinomial Allocation model,namely MSDMA.It has three main advantages: 1)learning topic information from several sources at the same time,with discovering the potential relationship between each source on the topic knowledge,and retaining the difference in vocabulary performance of the topic in different sources.2)Through the transfer learning method,under the fusion of different quality data sources,improve the topic discovery effect of low-quality sources with high noise and low information;3)Ability to learn the number of themes in each source autonomously.This is more adaptive to the multi-source than the traditional artificially set method.Based on MSDMA,the ?-MSDMA model is designed.The modeling process of the model is mainly divided into two parts.First,the MSDMA model is trained on a part of the data set.After the training is completed,the priori parameters of the topic-word distribution are updated to a new one.Then the new priori parameters are applied to the new set to enable the model to more accurately describe the observed data,and to get faster and more effective topic discovery.Finally,through large-scale experiments on simulated data sets and real data sets,we prove that our method can more effectively mine the topic of multi-source text than traditional mainstream methods.
Keywords/Search Tags:multi-source text, Dirichlet Multinomial Allocation model, topic model, Text mining
PDF Full Text Request
Related items