Font Size: a A A

Research And Application Of Text Classification Model Based On Topic Model

Posted on:2015-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:X F MuFull Text:PDF
GTID:2308330473452031Subject:Information security
Abstract/Summary:PDF Full Text Request
Text Classification has been proposed for a very long time. In the era when people use the Internet to send e-mails it had been applied to deal with spam emails. And now people have become accustomed to use text classification techniques to help distinguish spam and non-spam emails. With the develope of the Internet, text messages changes from the news, emails to the blog, forums, and now even to the microblog. Different forms of text messages also bring different application scenarios. For example, the context changes from written language to spoken language, the length of the context changes from long one to short one. For different application scenarios, text classification is also commonly used in the fisrt step of text mining. However, whether people can use text classification to classify text in an easy way, whether a new technology can be applied to text classification, these issues make text classification still worthy of constant study.In recent years, the conception of topic model has been proposed in the field of text mining. Different with the documents been represented by words, by using topic model theory, documents can be represented by the topics. That means documents can be considered to be constituted by a fixed number of topics. High-dimension has always been a problem in text classification. The research of this paper applies Latent Dirichlet Allocation which is an algorithm of the topic model to reduce the dimension of text. The main innovations and achievements are as follows:Proposed a text classification model based on topic model. We applied topic model into text classification. By using topic model to extract topices of the text data set and represent document by the topic vector space. Then we used support vector machine to generate classification model and use it to predict the category of new documents. Since the number of topics is much smaller then the number of terms, by using topic model to represent the documents can help solving the problem of high-dimensional. Also, compared with the representation of the words, topic can outline the text in semantic level better. The result of experiment showed that, our algorithm could reduce text dimension, improve the classification result. This illustrate the feasibility and effectiveness of our proposed algorithm used in text classification.In order to solve the multi-class classification problem in text classification, the paper also studied multi-class classification by using sopport vector machine. In the last, this paper implemented a visualization Chinese text classification system. The system supplies visual interface for users and makes it easier to accomplish data preprocessing, classification training, classification prediction, evaluation, results visualization and generate report..
Keywords/Search Tags:Topic model, Latent dirichlet allocation, Support vector machine, Multi-class classification, Feature selection
PDF Full Text Request
Related items