Font Size: a A A

Research On Text Categorization Based On LDA And SVM

Posted on:2013-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J XieFull Text:PDF
GTID:2248330362464304Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Automatic text classification is a research focus of information retrieval and data miningfield. It received extensive attention and rapid development in recent years. It is one of thekey technologies of machine learning and natural language processing. In recent years, themethods of machine learning were applied in the field of automatic text categorization. Theyhave shown the better performance than the traditional text categorization model, and havebecome the classic examples of the relevant research and application field.Feature selection and classification algorithm are the key issues of text categorization. Intext categorization, there is “dimension disaster” caused by high dimensions of feature space.When dealing with the large-scale multi-class textual data, the traditional feature selectionmethods performed poorly in the effect of characteristic dimension reduction and it iscommon to ignore the semantic relation between words. There are multi-categories,multi-sample numbers and noise in the actual textual data, and the number of all kinds ofcharacteristics is imbalance, the traditional classification algorithms can’t balance theclassification accuracy and speed.Research on the text classification and related technologies in this paper, thecorresponding solution or improved method are proposed from the angle of improving theclassification performance and reducing text dimension. The research work of this papermainly includes the following respects:(1) Joining term frequency and document frequency filters in the pretreatment stage oftextual data, introducing the categories information into the traditional LDA feature selectionalgorithm to discover the differences of the underlying theme internal, using double featureseletion methods to choose the most significant classification feature words.(2) According to the characteristics of the textual data, the LDA model is used toconstruct theme modeling separately in all kinds of training data, parameters are estimatedand calculated indirectly by Gibbs sampling, and each document is represented for theprobability distribution of fixed implied theme set, the hidden theme-text matrix is obtained.The textual data is simplified, the effect of dimension reduction is significant, and the trainingtime of classification algorithm is reduced.(3) The SVM classification algorithm is applied based on the working above, wecombined the LDA model of good characteristic performance with the SVM algorithm of powerful classification ability. Compared with the other characteristic selection method andthe classification algorithm, the experiments in Chinese and English corpus verify theeffectiveness and superiority. The effect of characteristic dimension is obvious, and the valueof F1, Macro-F1, Micro-F1and accuracy are obtained improvement.
Keywords/Search Tags:text categorization, feature selection, LDA model, multi-class categorization
PDF Full Text Request
Related items