Font Size: a A A

Research On Feature Selection And Weighting Method For Chinese Text Classification

Posted on:2014-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:W R SongFull Text:PDF
GTID:2268330392973420Subject:Mathematics
Abstract/Summary:PDF Full Text Request
As a key technology of organization and processing of large amounts ofdocument data, text classification technology is providing an effective way to solvethe problem of information in disorder. At the same time, it is also convenient forusers to retrieve the information they need. As text classification technology hasshown a great application value in information retrieval and filtering, it becomes a hottopic of research. Feature selection is an important part of the text classificationtechnology; it reduces the dimension of feature to improve the efficiency andaccuracy of the classifier.Due to some level-problem such as category and data, feature selection methodis facing many challenges. Imbalanced data problem widely exists in the applicationfield of text classification. In the imbalanced dataset, the number of positive samplesand negative samples vary greatly. When dealing with such kind of data problems,most of the traditional machine learning algorithm which based on the balanced dataset has a bias towards negative category, so that the classification effect is not ideal.The present studies for this kind of problems focus on two aspects of sampling andalgorithm level.This paper first introduces the general situation of text and related process, whichincluding preprocessing, feature selection and commonly used classificationalgorithms. Subsequently, we carry out an in-depth study of imbalanced data problemand put forward the solution, which is, ensuring the overall classification accuracy ofpositive samples, and combining the class distinction and average frequency factor toimprove the chi-square statistic. Though a group of contrast experiment, it shows thatour method performs better than traditional feature selection method when dealingwith imbalanced data problem. At the same time, we investigate the calculationmethod of feature weighting, and put forward the improvement method ofpolymerization of TF-IDF and feature selection method. On the basis of imbalanceddata experiment, it proves that the method we proposed is feasible and effective toimprove the accuracy of classification.
Keywords/Search Tags:Text Classification, Feature Selection, Imbalanced Dataset, FeatureWeighting
PDF Full Text Request
Related items