Font Size: a A A

A Text Classification Algorithm Based On Statistical Manifold Learning

Posted on:2018-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:2348330512486742Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text is a common used data type which is prevalent over the Internet.Moreover,the Internet generates large amounts of text data.Text classification methods play an important role in the information retrieval,data mining,sentiment analysis,etc.Ac-cording to the difference of feature extracting,text classification methods can be divided into the following three categories:statistics-based methods,semantic similarity-based methods and deep learning-based methods.Generally speaking,statistical-based tex-t classification algorithms include term frequency-inverse document frequency model(TF-IDF),naive Bayesian,etc.These methods use words as feature items and the num-ber of occurrences of words as weights.The texts are represented as a feature vector,and then the classifier is applied on the feature vectors.These methods assume that similar texts share many words in common.But this assumption ignores the fact that different words may be semantically similar.Text classification based on semantic sim-ilarity usually measures the similarity of texts according to the topic distribution over the text.But these methods cannot capture the diversity of topic distribution of words and texts clearly.Recently,deep learning has attracted great attention from many re-searchers,but these methods,e.g.convolution neural networks and recurrent neural networks,also have some limitations.For instance,gradient vanishing problem and the time consumption for large-scale parameter training.This paper presents a tex-t classification algorithm based on statistical manifold learning,which provides a text probabilistic model representation based on latent topics.Assuming that the words un-der the same topic follow a Gaussian distribution.the text is expressed as a Gaussian mixture model.The method of statistical manifold learning can be used to calculate the distance between texts.The main work of this dissertation can be summarized as follows:(1)According to the process of text generation,a text representation based on prob-abilistic model is proposed.Each topic is represented as a Gaussian distribution,and the texts are represented as a Gaussian mixture model.This probabilistic model could lead to a better description of the diversity of topics in texts and words.(2)The distribution of topics on the texts is described by a probabilistic model,then the time complexity of the algorithm for text modeling is estimated to be O(n),where n is the number of words in the text.The shortcoming of training speed and corpus dependence in the topic model can be alleviated.(3)Through the statistical manifold learning method,the distance of the text prob-abilistic model is calculated,and a novel idea of the metric to probabilistic models is provided.(4)In the experimental part,the validity of the proposed algorithm and the capa-bility of the Gaussian mixture model to describe the word distribution under the mixed topics are verified by three different tasks.
Keywords/Search Tags:Text classification, Manifold learning, Mixture model, Topic model
PDF Full Text Request
Related items