Research On Text Categorization Based On LDA And SVM

Posted on:2013-05-23

Degree:Master

Type:Thesis

Country:China

Candidate:J Xie

Full Text:PDF

GTID:2248330362464304

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Automatic text classification is a research focus of information retrieval and data miningfield. It received extensive attention and rapid development in recent years. It is one of thekey technologies of machine learning and natural language processing. In recent years, themethods of machine learning were applied in the field of automatic text categorization. Theyhave shown the better performance than the traditional text categorization model, and havebecome the classic examples of the relevant research and application field.Feature selection and classification algorithm are the key issues of text categorization. Intext categorization, there is “dimension disaster” caused by high dimensions of feature space.When dealing with the large-scale multi-class textual data, the traditional feature selectionmethods performed poorly in the effect of characteristic dimension reduction and it iscommon to ignore the semantic relation between words. There are multi-categories,multi-sample numbers and noise in the actual textual data, and the number of all kinds ofcharacteristics is imbalance, the traditional classification algorithms can’t balance theclassification accuracy and speed.Research on the text classification and related technologies in this paper, thecorresponding solution or improved method are proposed from the angle of improving theclassification performance and reducing text dimension. The research work of this papermainly includes the following respects:(1) Joining term frequency and document frequency filters in the pretreatment stage oftextual data, introducing the categories information into the traditional LDA feature selectionalgorithm to discover the differences of the underlying theme internal, using double featureseletion methods to choose the most significant classification feature words.(2) According to the characteristics of the textual data, the LDA model is used toconstruct theme modeling separately in all kinds of training data, parameters are estimatedand calculated indirectly by Gibbs sampling, and each document is represented for theprobability distribution of fixed implied theme set, the hidden theme-text matrix is obtained.The textual data is simplified, the effect of dimension reduction is significant, and the trainingtime of classification algorithm is reduced.(3) The SVM classification algorithm is applied based on the working above, wecombined the LDA model of good characteristic performance with the SVM algorithm of powerful classification ability. Compared with the other characteristic selection method andthe classification algorithm, the experiments in Chinese and English corpus verify theeffectiveness and superiority. The effect of characteristic dimension is obvious, and the valueof F1, Macro-F1, Micro-F1and accuracy are obtained improvement.

Keywords/Search Tags:

text categorization, feature selection, LDA model, multi-class categorization

PDF Full Text Request

Related items

1	Multi-class Scientific Literature Automatic Categorization System
2	A Study On Text Categorization Based On Machine Learning
3	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
4	Research On Chinese Text Categorization Algorithms Based On Technology Text
5	The Research Of Text Representation And Feature Selection In Text Categorization
6	The Research And Implementation Of Chinese Text Categorization System
7	Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set
8	Text Categorization Algorithm Based On Machine Learning
9	Design And Realization Of Text Categorization System
10	X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method