Research On Feature Selection And Multi-label Transformation Of Text Classification

Posted on:2012-06-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y Feng

Full Text:PDF

GTID:2178330332496363

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Textual data have been increased sharply with the rapid development of information technology. People have to save their manpower and financial resources by using text classification technology. Therefore, text classification research has become more and more important. The purpose of this paper is to enhance the performance of text classification. In order to achieve this goal, we choose two issues to study which are the feature selection and the transformation for multi-label textual data. We also achieved some research results by our study.In this paper, we first summarized five related technologies, which are segmentation, feature representation, feature extraction, classification algorithms, and performance evaluation. By design and implementation, we described the entire process of text classification. Based on this work, we choose feature selection and multi-label assignment as our research to study deeply.Secondly, in order to study feature selection, we analyzed several algorithms and found their defects by comparison and experiment. Using of restricted document distribution, a mass of high-frequency noise words was filtrated. After that, according to the characteristics of the four different methods DF, IG, MI and CHI, this paper presents a evaluation strategy to improve each feature selection by taking overall consideration between document frequency and term frequency. The results of contrast experiments indicated our four improved feature selection methods can not only reduce the high-noise frequency noise words but also get better classification results.Thirdly, multi-label classification is a important and complex issues in the field of text categorization. For this issue, we studied the methods of modeling and learning multi-label data and then analyzed several common multi-label assignments. Based on their drawbacks, this paper presents a new multi-label assignment, which is aimed to transfer multi-label data to single-label data by keeping the useful training samples and balancing the weight of samples in each category. The contrast experiment proved that compared with five traditional assignments, our improved assignment for multi-label document is more robust and can improve the effect of text classification.Experiments show that by using the comprehensive evaluation strategy between document frequency and term frequency and the multi-label assignment presented by this paper, we can improve precision of text classification effectively. These findings can be used for large scale text classification to achieve the goal of raising efficiency.

Keywords/Search Tags:

Data Mining, Text Classification, Feature Selection, Multi-label Document

PDF Full Text Request

Related items

1	Research On The Improvement Of Association Classification Algorithm And Feature Selection Of Multi-label Classification
2	Multi-Label Classification By Exploiting Relationship Of Labels
3	Research On The Multi-label Feature Selection And Classification Methods With The Label Correlations
4	Research On Multi-label Text Classification For Imbalanced Data
5	Research And Implementation On Text Classification In Vertical Domain
6	Research And Implementation Of Multi-label Text Classification Based On User Generated Content
7	Based On Decision Relevance Multi-label Classification And Feature Selection Algorithm
8	Research On Text Multi-label Classification Algorithm Based On Label Correlation
9	Feature Selection Method Research For Multi-label Classification
10	Based On The Rapid Large-scale Text Hierarchical Classification Problem Of Centralized