Font Size: a A A

Researches On Feature Selection In Text Classification

Posted on:2015-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:S C ZhaoFull Text:PDF
GTID:2308330461484954Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
In the research of text classification, the feature of high dimension features has a strong impact on the efficiency of text classification, so feature dimension reduction is an important and crucial step. Currently, the feature dimension reduction methods are mature and can be obtained can be obtained advisable result from both feature extraction and feature selection. All the traditional feature selection methods have a premise:the features in the training and test set follow the same distribution probability. Nevertheless, we found in the experiment that the distribution of the features in the training and testing set may presence certain differences, which will affect the accuracy of text categorization heavily.With the mushroom growth and application of information technology, the text, as an important carrier of information, increasingly involves the change of content and classes. Therefore the text data often present dynamic feature, and named the text data as dynamic text data in thesis. For dynamic text data set, the differences of feature distribution between the training and test set will become more obvious, and severely affect the accuracy of text categorization more.For solving above problem, the thesis focuses on finding effective methods to eliminate the differences. At first, venture decision is adopted considers the feature selection as a decision problem. In so doing, classification results of the algorithm can be improved. In addition, the thesis tries to explore ways to reduce or eliminate the difference from the viewpoint of transfer learning. The main works are concluded as follow:(1) Proposes a text feature selection approach based on venture decision. The development of artificial intelligence and the formation of the knowledge base make it be possible that modify the decision strategies in time and automatically that base on new information. According to this thought, the venture decision method is applied to the dynamic text categorization. It doesn’t take the relevance between features and text category into account, but directly adopts the utility function to evaluate the contribution of each feature word on classification results. And then it chooses some feature words with the biggest contributed value as feature dictionary to reduce dimension. The validity of the algorithm is verified on the Chinese mail and Chinese web page data set, and also the robustness of the algorithm is testified on English web data set. Experiment results have show that the proposed feature selection method based on venture decision can select the feature words which have more influence on classification, and a significantly improved the text classification results.(2)Presents a text feature selection approach based on transfer learning. Transfer learning method is appropriate to solve the problem proposed in this thesis. However, transfer learning based on machine learning can’t be widely aware until this century. Usually transfer learning can be divided into four major categories:instance transfer, feature representation transfer, model transfer, relational knowledge transfer. But there is still no appropriate migration learning algorithm for text feature selection. This thesis firstly gives a brief introduction on typical transfer learning algorithm, and then presents an improved algorithm is also testified by experiments.The purpose of this thesis is to find the problem through experiments, and then tries to find the solution from different angles by profoundly analyzing the experiment results. The proposed two feature selection methods based on venture decision and transfer learning, respectively, can well overcome the limitations for traditional feature selection algorithms. Therefore, good text categorization results can be obtained. The research results are significant for expend the application way of SVM as well.
Keywords/Search Tags:Chinese Text Classification, Dynamic Data Set, Feature Selection, Venture Decision, Transfer Learning
PDF Full Text Request
Related items