Font Size: a A A

An Application Research Of Probabilistic Topic Model On Text Classification

Posted on:2010-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y G LinFull Text:PDF
GTID:2178360302459721Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data skew and noise samples are frequently encountered in text classification applications. In skewed data, the samples cannot correctly reflect the real data distribution and the classifier may ignore the rare class overwhelmed by the large class samples. Most classification methods are designed for balanced data, so they cannot achieve perfect performance for skewed data. On the other hand, the performance of classifiers is mostly determined by the quality of training corpus. However, in real application, especially with large scale of data, the quality of training corpus is unreliable and noisy samples are unavoidable, which will hinder the performance of classifiers. For such situations, a text classification method and a noise processing method with LDA (Latent Dirichlet Allocation) probabilistic topic model are proposed. Using the global semantic information to generate texts artificially, classification methods can achieve better performance.The contribution of our work is listed as follows:First, a LDA-based text generation method is proposed. LDA models are extracted from training corpus by Gibbs sampling, and then texts are generated with the generative process of LDA. Experiment results show that the generated text documents have good similarity to the original text documents without incurring over-fitting.Second, for skewed data, DECOM, a novel text classification approach based on LDA model, is proposed. This approach can increase the amount of instances belonging to rare classes in the training corpus besides avoiding the over-fitting encountered in traditional methods. Experimental results on real data sets shows that DECOM is more suitable than other methods for text classification.Finally, we propose a noise processing method for classification using LDA. In our method noisy samples are filtered according to class entropy. Then the data is smoothed using the generative process of topic model to further weaken the influence of noisy samples. Meanwhile the original size of training corpus is kept unchanged. Experimental results on real-world data show that it is robust to the distribution of noise and has a relatively good performance on data sets with high noise ratio.The detailed experiments and theoretical analysis show that the sematic information in the training corpus can be extracted and exploited by probabilistic topic model so that better text classification performance in intricate pratical situations can be achieved.
Keywords/Search Tags:text classification, probabilistic topic model, data skew, class imbalance, noise, class entropy
PDF Full Text Request
Related items