An Application Research Of Probabilistic Topic Model On Text Classification

Posted on:2010-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y G Lin

Full Text:PDF

GTID:2178360302459721

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data skew and noise samples are frequently encountered in text classification applications. In skewed data, the samples cannot correctly reflect the real data distribution and the classifier may ignore the rare class overwhelmed by the large class samples. Most classification methods are designed for balanced data, so they cannot achieve perfect performance for skewed data. On the other hand, the performance of classifiers is mostly determined by the quality of training corpus. However, in real application, especially with large scale of data, the quality of training corpus is unreliable and noisy samples are unavoidable, which will hinder the performance of classifiers. For such situations, a text classification method and a noise processing method with LDA (Latent Dirichlet Allocation) probabilistic topic model are proposed. Using the global semantic information to generate texts artificially, classification methods can achieve better performance.The contribution of our work is listed as follows:First, a LDA-based text generation method is proposed. LDA models are extracted from training corpus by Gibbs sampling, and then texts are generated with the generative process of LDA. Experiment results show that the generated text documents have good similarity to the original text documents without incurring over-fitting.Second, for skewed data, DECOM, a novel text classification approach based on LDA model, is proposed. This approach can increase the amount of instances belonging to rare classes in the training corpus besides avoiding the over-fitting encountered in traditional methods. Experimental results on real data sets shows that DECOM is more suitable than other methods for text classification.Finally, we propose a noise processing method for classification using LDA. In our method noisy samples are filtered according to class entropy. Then the data is smoothed using the generative process of topic model to further weaken the influence of noisy samples. Meanwhile the original size of training corpus is kept unchanged. Experimental results on real-world data show that it is robust to the distribution of noise and has a relatively good performance on data sets with high noise ratio.The detailed experiments and theoretical analysis show that the sematic information in the training corpus can be extracted and exploited by probabilistic topic model so that better text classification performance in intricate pratical situations can be achieved.

Keywords/Search Tags:

text classification, probabilistic topic model, data skew, class imbalance, noise, class entropy

PDF Full Text Request

Related items

1	Research On Class Semantics And Imbalanced Distribution Methods For Multi-Label Text Classification
2	On Text Classification And Its Applications
3	Research On The Text Classification Method Based On Transfer Tonic Model
4	The Research Of Class Imbalance Classification Model In Data Mining
5	Research On Data Imbalance In Visual Tracking
6	Classification Algorithms For Class Imbalance Data
7	Research On Anti-Noise Flow Classification And New Class Detection
8	Research On KNN Text Classification
9	Research On Class Imbalance Classification Algorithm For Stream Data
10	Research And Application Of Text Classification Model Based On Topic Model