Font Size: a A A

Research On The Term Weighting Scheme And Text Representation Strategy For Text Categorization

Posted on:2017-03-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:L J GuFull Text:PDF
GTID:1108330485456889Subject:Intelligent Environment Analysis and Planning
Abstract/Summary:PDF Full Text Request
Data has penetrated into every industry, and it has become an important factor in production.With the arrival of the era of big data, the requirement of text information processing technology is increasing day by day. Artificial management has been unable to meet the social needs. Automatic text categorization technology is becoming more and more important, which has become a hot research topic in the research community.This thesis analyzes and summarizes the framework of text classification, text representation, text preprocessing, feature selection, feature extraction, term weighting, text classifier and classification performance evaluation. Furthermore, this thesis respectively proposes novel methods for term weighting and text representation strategy. For the balanced dataset, proposed two term weighting algorithms. For the imbalanced dataset, proposed one term weighting algorithm. There are three supervised feature weighting algorithms. In addition, according to the supervised term weighting algorithm, this thesis presents an optimal text representation strategy. The achievements are as follows:1. The term weighting scheme based on category informationAs most classifiers use the Vector Space Model, term weighting has become the bottleneck of categorization. The results of term weighting affect directly categorization performance. Based on the analysis of traditional term weighting algorithm, the author proposes a new term weighting algorithm.By replacing word features with category-based features, the dimensionality of the document feature space can be reduced from tens of thousands to a small number of categories. The term representation matrix is no longer a sparse matrix. The proposed scheme not only can improve text classification accuracybut also improve classification speed and reduce the classification time effectively.2. The term weighting scheme based on class space densityBased on the analysis of inverse category frequency in traditional term weighting algorithm, the class space density is introduced, and hence inverse class space density frequency is introduced into the term weighting algorithm. When measuring the distinguishing ability of terms, the proposed scheme can assign different weights to terms, which have same category frequency and different document frequency. The proposed scheme can more objectively reflect importance degree of feature to classification andimprove the spatial distribution of samples effectively. It also make samples in the same class more compact and those in different classes loose. Replacing icf in tf*icf and icf-based with the proposed ICSDF, then two new term weighting schemes were obtained: tf*ICSDF and ICSDF-based. The results show that the proposed term weighting schemes can achieve better text categorization performance.3. The term weighting scheme for imbalanced datasetWhen using the commonly used term weighting algorithm to weight imbalance dataset, the expected effect can not often achieved. It is mainly due to the particularity of data distribution of imbalance dataset. Based on the analysis of data distribution of imbalance dataset, we proposed a new term weighting scheme for imbalanced dataset. The proposed scheme by combining the probability of a term in positive category and the probability of a term in negative category to measure the importance of different terms for text classification in imbalanced dataset, and according to its importance to assign the corresponding weight.In the experiments,the proposed term weighting scheme(tf*WID) and four commonly used term weighting schemes(tf*idf, tf*ig, tf*chi2 and tf*or) are used on the two imbalanced data sets(WebKB and Yahoo! Answers(100-1000)). We choose Rocchio and SVM as classifier. The comparison and analysis are made on two aspects(MicroF1and MacroF1). The results show that the proposed term weighting schemes can effectively improve the classification performance for imbalanced dataset classification.4. The optimal document represent for supervised term weighting schemesBased on the analysis of traditional text representation strategy(global policy and local policy),thisthesis proposes a new optimal document represent strategy for supervised term weighting schemes based on Vector Space Model. The proposed scheme is based on the idea of finding optimal model on the training set. It can get an optimal term weighting vector from the term weighting vectors which produced by categories. After applying the optimal term weighting vector to the test set we can get the optimal text representation for test set.The proposed schemeis validated on two data sets(Balanced dataset 20 Newsgroups and imbalanced dataset Reuters-21578). In the experiments, two commonly used supervised term weighting schemes(tf*or and tf*rf) are used to weight the term matrix of the two data sets. By using the proposed scheme,finding the best document representation on the training set, and then applied to the test set. We choose SVM as classifier. The results show that the proposed optimal document represent strategy for supervised term weighting schemes can effectively improve the classification performance.
Keywords/Search Tags:Machine Learning, Text Categorization, Term Weighting, Term Dimension Reduction, Class Space Density, Text Representation Strategy
PDF Full Text Request
Related items