Font Size: a A A

Research And Application Of The Methods For Optimization And Filtration Of Topics

Posted on:2016-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y XieFull Text:PDF
GTID:2308330470478587Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the attention of topic model, it has been applied to various fields by both Chinese and foreign scholars. Topics of the data catch more semantic features with low dimensional as the latent representation. At the stage of topic model research,more and more improved topic models are developed,however, the quality of these models are not very desirable. How to improve the quality of these topics, which has got many attentions ofresearchers.This paper propose series of filter and optimization methods to improve the quality of the topics, including that the part of speech analysis are introduced in processing of corpus, which keep useful nouns and verbs in the document; theme semantic consistency of this method was proposed to evaluate quality and topic setting threshold, subjecting to filter the noise, for words in topics, we use topical words analysis to remove the noisy word in the topic. Finally we can get topics with higher quality, then we can inference the new documents by these topics and get the latent representation to text classification. Compared with the original LDA, the proposed topic of the filter and optimization method can effectively improve the result of text classification. The work of this paper is organized as follows:Topic filtering method research and apply.In order to achieve a better topic perform, we need a precise quality evaluation method. The traditional method evaluated the quality of topics by topic coherence, which based on the current corpus, could not be adjusted by an external corpus. This paper proposes the topic semantic consistency to evaluate the quality of topics, using an external corpus (Wikipedia 2014) to generate the word vector, according to the word vector to compute the distance of the two words on the vector space. We combine theme consistent with the document of word frequency matrix, conducted by the external corpus to evaluate the quality of the topics accurately. We set the threshold to filter the noisy topics to prove the performance in text classification tasks.The theme of optimization method research and apply. In this paper, topical words are that used in current corpus more often than used in general English, which are helpful to generalize the center of one document.We use topical words assessment to optimize topics and reduce the influence of noisy words. Moreover, we use a hierarchy semantic dictionary to calculate the correlation between words in one topic, this dictionary called WordNet, then we determine whether the words in the topic are related in semantic level to optimize the topic and improve the performance of text classification.
Keywords/Search Tags:Topic Filter, Topic Optimization, Topic Semantic Coherence, Topical words analysis
PDF Full Text Request
Related items