Font Size: a A A

Research On Imbalanced Sentiment Classification Based On Data Augmentation And Graph Convolutional Network

Posted on:2023-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z C PangFull Text:PDF
GTID:2558307070983349Subject:Signal and Information Processing
Abstract/Summary:
With the development of Internet platforms such as Weibo and Taobao,people are increasingly accustomed to expressing their opinions and attitudes on the platforms.Massive text data are generated on the Internet every moment.It is a challenge to identify the sentimental tendencies of these data and fully mine the commercial and social values contained in them.Benefit from the development of sentiment analysis in deep learning,neural network can be trained by labeled texts to predict the sentimental tendencies of unlabeled texts in various fields.Nevertheless,the class distribution of sentiment dataset is usually skewed in real scenarios,where the sample size in one class is much larger than the other one,causing the model to prefer the class with larger size during training.Based on this observation,this thesis studies the imbalanced binary sentiment classification from the data preprocessing level and algorithm level.The main work is as follows:(1)Class weighting(CW)is introduced to design a term weighting scheme called TF-IGM-CW for imbalanced sentiment datasets.The weighting separates different classes in vector space,and can be utilized to extract feature words of different classes respectively.On this basis,the concept of sentiment centroid is proposed to extract representative data and noisy data in vector space,which can be used as rewriting data for data preprocessing level.(2)In the aspect of data preprocessing,a two-stage balancing strategy(TSBS)based on data augmentation is designed,which can be divided into the over-sampling stage and the noise revise stage.In the Over-sampling stage,new samples are generated to balance the distributions of classes by replacing feature words of the representative data.In the noise revise stage,the sentimental tendency of the noise data is also revised by replacing feature words.Finally,a sentiment dataset with balanced class distribution and clear class boundaries is obtained.(3)In terms of the algorithm,this thesis selects the graph convolutional neural network whose model generalization ability is strong as the classification model.By introducing class nodes and TF-IGM-CW,a novel model called Senti-GCN is proposed for text sentiment classification,which is based on the idea of heterogeneous text graph of Text-GCN.Thus,the classification performance of imbalanced sentiment datasets can be improved at the algorithm-level.Moreover,the thesis also studies the performance of imbalanced sentiment classification when TSBS is integrated with Senti-GCN.(4)This thesis verifies the effectiveness of proposed approaches on four public datasets in the sentiment domain.Compared with original imbalanced datasets,experimental results indicate that TSBS can provide an average improvement of 2.97% for the classification accuracies of models.The average improvement of the over-sampling stage is 2.11% and more 0.86% of improvement can be achieved after the noise revise stage.Additionally,Senti-GCN shows excellent classification performance on imbalanced datasets.Compared with Text-GCN,the classification accuracy is improved by about 0.73% on average.After being integrated with TSBS,more 2.17% of improvement can be achieved on average.
Keywords/Search Tags:Imbalanced Sentiment Classification, Term Weighting, Data Augmentation, Graph Convolutional Network
Related items