Font Size: a A A

The Research And Implementation Of Chinese Text Classification Based On Feature Selection And LDA

Posted on:2015-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:L L DongFull Text:PDF
GTID:2268330428968665Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present, science technology is advancing by leaps and bounds, the informatization process is accelerating and the Internet is popular in more areas of the world. In this case, people have more and more channels to obtain, diffuse and share information. However, people also face a big challenge------"information explosion" at the same time. Thus, how to find the needed messages efficiently, accurately and conveniently turned into a special concern of people. Under the circumstances, text classification arrives. As an important form of data analysis, text classification is quite a valid method to organize and manage data, and it has been widely applied in search engine, digital library, E-Government, spam filtering and other fields.As an effective means for text processing, text classification includes preprocessing, feature selection, text expression, classifier selection, the training of the classifier, the testing of classifier, the evaluation of the classification effect and other steps. To put it simply, its ultimate aim is to predict the category labels for test texts. In the whole system of text classification, each link could affect the final classification effect. The preprocessing is used to reduce dimension preliminarily to lower redundancy, and it’s one of the preparations for the later classifier. As the core of dimension reduction of texts, feature selection can remove the noise feature. Text expression is a method to convert the non-formatted text to a formatted one, so that the computer could recognize and manage the text efficiently. The classifier plays the role of predicting category label. It learns a classification function through training, and the function could map a text to a label, then we could test the trained classifier’s effect on new data. The assessment of classification effect is used to evaluate the entire classification system comprehensively and objectively.This paper chooses feature selection and text expression as the focuses of study. To make up the deficiency of the traditional feature selection, this paper makes a variety of improvements and proposes the idea to combine LDA models with them to fill in the gaps which emerge when use LDA alone, thereby further improves the classification effect.Firstly, this paper proposes three indicators:the relative term frequency, the dispersion and the maximum absolute value to solve the problems caused by the ignorance of term frequency in the traditional mutual information feature selection. With the three factors, the traditional mutual information feature selection is improved because its shortages are covered.Secondly, this paper proposes the indicator which called "max proportion of term frequency" to deal with the problem that the traditional IG’s classification effect decreased a lot when used on an imbalanced dataset. In this way, a improved IG feature selection method which could obtain good effect no matter the dataset is balanced or not is produced.Finally, this paper proposes the idea to combine feature selections and LDA which is a topic model to deal with the problem that the classification accuracy is not high enough when using LDA alone. As a topic model, LDA can not only represent the texts by the probability of the theme, but also reduce dimension like the feature selection methods. However, the accuracy is not high when using LDA alone. Therefore, this paper focuses on using it as a text represent method, and before this, using the feature selection methods to further improve the classification effect.The above is the main research work of this paper. The experimental results show that the improved mutual information feature selection method and the improved IG feature selection which this paper presents can do make up for the deficiencies of the traditional ones. What is more, compared with using LDA alone, the method which combines the improved feature selection and LDA could get a better effect in text categorization.
Keywords/Search Tags:text classification, feature selection, LDA model, mutual information, information gain
PDF Full Text Request
Related items