Research On Text Representation Model Based On LDA And Latent Feature Vector

Posted on:2020-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:H J Peng

Full Text:PDF

GTID:2428330572473671

Subject:Information and Communication Engineering

Abstract/Summary:

As an effective means of dealing with unstructured information,text categorization has been extensively studied and applied to the field of natural language processing.However,due to the unstructured,high-dimensional and high sparsity of text data,the effective representation of text information is a key factor affecting the effect of subsequent text processing.The effect of text categorization is also highly dependent on the effect of the text representation model.The commonly used text representation method is based on the text representation of the text subject,so the accuracy of the topic model directly affects the accuracy of the text representation.This paper mainly studies text topic models and text representation models.The LDA model predicts each word in a document with a global concept,but it does not contain the contextual relationship of the text feature words,and the local semantic information of the article is missing.At present,the model improvement methods based on LDA and latent features are basically divided into two categories,one for short texts,and the topic prediction of text is improved by expanding the word vector library on the large corpus;the other is to directly calculate the topic vector by adding the word vector,but the word vector and the topic-word distribution trained by such methods are trained by different models,not in the same semantic space,and directly summed by the word vector.Getting the theme vector is not accurate enough.In this paper,considering the defect of LDA and existing improved models,the latent feature vector containing the semantic features of the text is introduced into the model,and the text topic representation model LFV-LDA based on LDA and latent feature vector is proposed.The word vector and topic vector are trained in the same semantic vector space and the document-topic-word hierarchy,and the improved model can directly output the text topic vector.The experimental results of training and testing the LFV-LDA model on the news corpus show that the text topic representation based on LDA and latent feature vectors has a certain improvement compared with the traditional topic model representation and the similar LDA improved models.After obtaining a text subject vector that excels in text categorization,this paper proposes two text representation methods.The first is a text representation method based on the probability distribution relationship of the topic vector,which performs text representation by normalizing the weighted statistical subject vector.The second is a text representation model based on Doc2Vec and topic vector,which integrates the topic information into the text representation by measuring the distance between the document vector and the topic vector trained by the model.Finally,the models are trained and tested on the news corpus.The experimental results show that the two models have better text classification effects than the traditional models,and the second text representation model is better than the first one.

Keywords/Search Tags:

LDA, feature vector, topic model, text representation, text categorization

Related items

1	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
2	The Research And Implementation Of Chinese Text Categorization
3	Research On Chinese Text Categorization Algorithms Based On Technology Text
4	A Study On Text Categorization Based On Machine Learning
5	Application Of CTM Model Optimization Feature Selection In Text Categorization
6	Modeling And Implementation Of Chinese Text Categorization System Based On SVM
7	Text Representation And Algorithms For Chinese Text Classification
8	Studies On Some Essential Problems In Automatic Text Categorization
9	The Research And Implementation Of Chinese Text Categorization System
10	Multi-class Scientific Literature Automatic Categorization System