Font Size: a A A

Research And Implementation Of Feature Selection In Chinese Text Classification

Posted on:2015-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y F LinFull Text:PDF
GTID:2308330464468741Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Along with the swift development of information technology and rapid popularization of the Internet technology, the information stored in computer systems increases with amazing speed. Electronic text is one of the most important form of data in computer systems, the growth of which is as astonishing, as the growth of the information in the internet. And the large amount of texts include the huge amount of information valuable to people. So, we need scientific and effective method to gain valuable information fastly and effectively from the large amount of texts. Automatic text classification technology based on the machine learning is one of the scientific and effective methods to solve the problem of information disorder to a great extent and help people classify large amount of texts more efficiently. Therefore, the study of the automatic text classification technology has a very significant and practical significance.The core step of the process of automatic text classification is the feature selection. And the efficient text classification Classifier requires the feature witch compose the feature vector space have a strong information of classification, while the vector space also should do a good balance among the information in the various categories of texts. This paper analyzes the traditional feature selection methods that include document frequency(DF), information gain(IG), mutual information(MI), chi-square statistic(CHI),and expected cmss entmph(ECE) strengths and weaknesses,, finding that the traditional features are based on the importance of feature word in one aspect,that are lack of a comprehensive measure of the feature words. This paper presents a feature selection method based on the features importance witch comprehensively considers the aspects of word frequency, document frequency, uniformity within class and feature word discrimination of global category. Of the new feature selection method,the feature word discrimination of global category is presented based on the difference between two types information in the mutual information, and the sample mean variance factor is introduced to it for improving the mutual information method which tends to low frequency words.This paper also designs and implements a Chinese text categorization system to verify the effectiveness of the new feature selection method. This paper uses a modular design method to design and implement the Chinese text categorization system with the KNN classification algorithm and Bayesian classification algorithm, and uses the system to test the effectiveness of the new feature selection method. the comparison experiment between the traditional feature selection method MI, DF, CHI and the proposed selection method show that the new feature selection method can effectively reduce the dimension of the feature space and extracts the feature, obtaining a good results and reflecting the degree of difference between categories.
Keywords/Search Tags:Text Classification, Feature Dimensionality Reduction, Feature Selection
PDF Full Text Request
Related items