The Research And Implementation Of Chinese Text Classification Based On Feature Selection And LDA

Posted on:2015-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:L L Dong

Full Text:PDF

GTID:2268330428968665

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

At present, science technology is advancing by leaps and bounds, the informatization process is accelerating and the Internet is popular in more areas of the world. In this case, people have more and more channels to obtain, diffuse and share information. However, people also face a big challenge------"information explosion" at the same time. Thus, how to find the needed messages efficiently, accurately and conveniently turned into a special concern of people. Under the circumstances, text classification arrives. As an important form of data analysis, text classification is quite a valid method to organize and manage data, and it has been widely applied in search engine, digital library, E-Government, spam filtering and other fields.As an effective means for text processing, text classification includes preprocessing, feature selection, text expression, classifier selection, the training of the classifier, the testing of classifier, the evaluation of the classification effect and other steps. To put it simply, its ultimate aim is to predict the category labels for test texts. In the whole system of text classification, each link could affect the final classification effect. The preprocessing is used to reduce dimension preliminarily to lower redundancy, and it’s one of the preparations for the later classifier. As the core of dimension reduction of texts, feature selection can remove the noise feature. Text expression is a method to convert the non-formatted text to a formatted one, so that the computer could recognize and manage the text efficiently. The classifier plays the role of predicting category label. It learns a classification function through training, and the function could map a text to a label, then we could test the trained classifier’s effect on new data. The assessment of classification effect is used to evaluate the entire classification system comprehensively and objectively.This paper chooses feature selection and text expression as the focuses of study. To make up the deficiency of the traditional feature selection, this paper makes a variety of improvements and proposes the idea to combine LDA models with them to fill in the gaps which emerge when use LDA alone, thereby further improves the classification effect.Firstly, this paper proposes three indicators:the relative term frequency, the dispersion and the maximum absolute value to solve the problems caused by the ignorance of term frequency in the traditional mutual information feature selection. With the three factors, the traditional mutual information feature selection is improved because its shortages are covered.Secondly, this paper proposes the indicator which called "max proportion of term frequency" to deal with the problem that the traditional IG’s classification effect decreased a lot when used on an imbalanced dataset. In this way, a improved IG feature selection method which could obtain good effect no matter the dataset is balanced or not is produced.Finally, this paper proposes the idea to combine feature selections and LDA which is a topic model to deal with the problem that the classification accuracy is not high enough when using LDA alone. As a topic model, LDA can not only represent the texts by the probability of the theme, but also reduce dimension like the feature selection methods. However, the accuracy is not high when using LDA alone. Therefore, this paper focuses on using it as a text represent method, and before this, using the feature selection methods to further improve the classification effect.The above is the main research work of this paper. The experimental results show that the improved mutual information feature selection method and the improved IG feature selection which this paper presents can do make up for the deficiencies of the traditional ones. What is more, compared with using LDA alone, the method which combines the improved feature selection and LDA could get a better effect in text categorization.

Keywords/Search Tags:

text classification, feature selection, LDA model, mutual information, information gain

PDF Full Text Request

Related items

1	Research And Improvement Of Feature Selection Algorithm In Text Classification
2	Research On Feature Selection Algorithm Of Spam Filtering
3	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
4	On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection
5	Study Of Mutual Information Feature Selection In Chinese Text Classification
6	Analysis And Study On Feature Selection Method In Chinese Text Categorization
7	The Research Of Feature Selection Method In Text Classification Based On Triple-Play
8	Research Of Chinese Text Classification Algorithms Based On VSM
9	Research Of Feature Selection For Text Classification
10	Research On The Algorithm Of Feature Selection Based On Mutual Information For Text Categorization