Font Size: a A A

Research On Feature Selection And Multi-label Transformation Of Text Classification

Posted on:2012-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y FengFull Text:PDF
GTID:2178330332496363Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Textual data have been increased sharply with the rapid development of information technology. People have to save their manpower and financial resources by using text classification technology. Therefore, text classification research has become more and more important. The purpose of this paper is to enhance the performance of text classification. In order to achieve this goal, we choose two issues to study which are the feature selection and the transformation for multi-label textual data. We also achieved some research results by our study.In this paper, we first summarized five related technologies, which are segmentation, feature representation, feature extraction, classification algorithms, and performance evaluation. By design and implementation, we described the entire process of text classification. Based on this work, we choose feature selection and multi-label assignment as our research to study deeply.Secondly, in order to study feature selection, we analyzed several algorithms and found their defects by comparison and experiment. Using of restricted document distribution, a mass of high-frequency noise words was filtrated. After that, according to the characteristics of the four different methods DF, IG, MI and CHI, this paper presents a evaluation strategy to improve each feature selection by taking overall consideration between document frequency and term frequency. The results of contrast experiments indicated our four improved feature selection methods can not only reduce the high-noise frequency noise words but also get better classification results.Thirdly, multi-label classification is a important and complex issues in the field of text categorization. For this issue, we studied the methods of modeling and learning multi-label data and then analyzed several common multi-label assignments. Based on their drawbacks, this paper presents a new multi-label assignment, which is aimed to transfer multi-label data to single-label data by keeping the useful training samples and balancing the weight of samples in each category. The contrast experiment proved that compared with five traditional assignments, our improved assignment for multi-label document is more robust and can improve the effect of text classification.Experiments show that by using the comprehensive evaluation strategy between document frequency and term frequency and the multi-label assignment presented by this paper, we can improve precision of text classification effectively. These findings can be used for large scale text classification to achieve the goal of raising efficiency.
Keywords/Search Tags:Data Mining, Text Classification, Feature Selection, Multi-label Document
PDF Full Text Request
Related items