Font Size: a A A

A Research On Multiple Text Representation For Text Classification

Posted on:2019-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:N Q LiFull Text:PDF
GTID:2428330545485304Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification is a fundamental task in natural language prpcessing.Text representation is a key point of text classification.The performance of text classification largely depends on the quality of text representation.Text is composed of words or characters to form phrases,sentences,paragraphs,sections,chapters,articles,etc.Machine learning algorithm,neural network and deep learning require a real-valued vector or matrix as the input.Raw documents can not be processed by the algorithms directly.Text representation transforms raw documents to vectors or matrices that can be computed by computers.The core of the text representation is to truly reflect the content of the text while maintaining the differentiation of different texts.There are multiple contextual features in text data.Traditional text representation algorithm generates only one representation of the data to represente multiple contextual features.This will weaken the representation of these contextual features.In this paper,we introduce a new way of text representation,called multiple text representation.Multiple text representation generate feature represetation for each of the contextual structures in text data to enhance representation the text.In this paper,we will introduce three ways to generate multiple text representations:1.Alter k-Means model.Alter k-Means model can generate multiple clustering of text data.Each clustering has a set of representative vectors.These representative vectors project the original data into a new feature space.Multiple clusterings mean multiple sets of representative vectors and multiple features spaces.By projecting data into multiple feature spaces,we can enhance feature extraction of data.2.Alter LDA model.Alter LDA model can find multiple topic structures of the text data and generate multiple text representations with respect to these topic structures.Alter LDA use "Topic-Word" distribution to be the topic structures of text and "Document-Topic" distribution to be the features of each document.3.Horizontal multiple text representation use different text representation algorithm to generate multiple text representation.Each algorithm finds a different contextual feature of the textExpriments show that with multiple text representation,we can improve the performance of text classification while reducing the dimension of features.
Keywords/Search Tags:Text Representation, Multiple Clustering, Multiple Feature, Text Classification
PDF Full Text Request
Related items