Font Size: a A A

Research On Sentiment Classification Based-upon Imbalanced Data

Posted on:2013-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q WangFull Text:PDF
GTID:2248330371993565Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, it becomes more and more popular for people to express their opinion and sentiment on the internet. To automatically analyze these subjective information, sentiment analysis has been proposed and received a great deal of attention in Natural Language Processing community. In sentiment analysis, sentiment classification is a basic task and has undergone significant development. However, most existing studies assume the balance between the numbers of negative and positive samples, which may not be true in reality. Since the imbalanced problem would reduce the performance of traditional machine learning approaches, it is considered as a serious problem for urgent solution.This paper conducts extensive studies on imbalanced sentiment classification with the efforts. The key issues of our research are summarized as follows:First, this paper proposes a novel classification method for imbalanced sentiment classification with sample ensemble. Generally, under-sampling is an effective strategy to deal with class imbalanced problem. However, the major problem of under-sampling is that it discards many potentially useful majority-class samples. To fully make use of all samples in majority-class samples, we propose a sample-ensemble learning approach which combines multiple member classifiers generated by under-sampling majority-class samples several times. Furthermore, we also propose another ensemble strategy of using different classification algorithms to guarantee more diversity member classifiers to improve the classification performance.Second, this paper proposes a novel classification method for imbalance sentiment classification based on centroid-directed vector which is used to address the feature imbalanced problem. In sentiment classification, the feature space for representing text is in very high dimension. Accordingly, in imbalanced sentiment classification, the majority class samples normally contain much more kinds of occurring features than minority-class samples. Therefore, imbalanced sentiment classification suffers not only the imbalanced class distribution problem but also the imbalanced feature distribution problem. In this study, we propose a novel clustering-based stratified under-sampling framework and a centroid-directed smoothing strategy to address the imbalanced class and feature distribution problems respectively.Third, we propose a semi-supervised learning approach based on the co-training algorithm with dynamic random features subspace generation to address the shortage of labeled data in imbalanced sentiment classification, This approach could make full use of all labeled samples and also enlarge the diversities of the member classifiers, which enables a great improvement on classification performance of imbalanced sentiment classification by employing unlabeled data.
Keywords/Search Tags:Sentiment Classification, Imbalanced Classification, Feature Imbalanced, Semi-supervised Learning, Ensemble Learning
PDF Full Text Request
Related items