Font Size: a A A

An Improved Approach To Feature Selection Of Chinese Text Categorization Based On Correlation Grouping Principle

Posted on:2017-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:W L ZhuFull Text:PDF
GTID:2348330488978135Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,especially the rapid development of information technology,The Internet has penetrated into people's daily lives,followed by a massive increase in the amount of information.People receive and send the information every day most of which are composed by the text.Therefore,how to manage the vast amount of text information has become a problem which has to be faced.Text classification technology can help people managing a large number of text messages efficiently and conveniently,which have a wide range of application prospects in information retrieval,information filtering,search engine,text database,digital library and other fields.With the maturity of the text categorization technology,its application is more and more widely.Chinese text classification technology has become a hot research topic because of its special characteristics.Generally speaking,text categorization should include the main steps of text preprocessing,feature selection,feature weighting,classification algorithms,and evaluation of classification results,each of these steps will have an impact on the final classification.The main function of text preprocessing is to convert the text into a structured form that the computer can read,and to reduce the dimension of the data.Feature selection is the main part of the data dimension,and it is convenient to deal with the whole data set.Feature weighting is to assign different weights to each feature of the feature subset.Then select the appropriate classification algorithm to train the classifier.At last,the classification performance of the classifier is evaluated,and the parameters of the classifier are adjusted until the classification effect is reached.Through the above steps,we can get a good text classifier,which is used to classify the text automatically,and to manage the text information efficiently and effectively.The main contribution of this paper is summarized as the following:(1)Several main links of text categorization are analyzed,and the particularity of Chinese text classification is discussed.For example,Chinese text according to the phrases or even short sentences to word segmentation in text preprocessing,which is much more different from English text which is separated with whitespace.This will become an important factor affecting the final results of the Chinese text classification.(2)The feature selection of text categorization is mainly studied.The advantages and disadvantages of the traditional feature selection methods and the popular feature selection methods in machine learning are analyzed.According to the characteristics of Laplace scoring algorithm ignores the correlation between features and easily chooses important but this defect features overlap,this paper presents an algorithm based on feature correlation grouping and Laplace scoring algorithm,which making the algorithm more perfect.(3)Further improved the feature correlation grouping principle.And the other feature selection algorithm is generalized,even other image classification.Experiments show that the majority of feature selection algorithm has been significantly improved.This is the main work of this paper.Finally,experiments on several data sets show that: the principle of feature correlation grouping which is presented in this paper is suitable for most feature selection algorithms.
Keywords/Search Tags:text categorization, feature selection, feature correlation, feature grouping, Laplace Score
PDF Full Text Request
Related items