| When there is rapid development in artificial intelligence(AI),text classification technology has been popularized in the intellectualized service of various industries,improving the processing efficiency of texts.However,this application is facing severe challenges from complex data scenarios.Due to Intrinsic distribution differences or distinctive acquisition difficulties,the naturally collected datasets often show the characteristics of imbalanced label distribution.Model is easily dominated in the training process by categories with more samples and perform poorly on rare classes,which limits the practicality of classification models based on deep network in real-world applications.Therefore,classification of imbalance data is a practical problem to be solved urgently.This thesis studies on text classification with imbalanced data,aiming to build text classification models with better performance under imbalanced data distribution.The main contributions include:1.To address the issue that feature space learned on imbalanced data distribution by standard cross-entropy loss function is also imbalanced and the performance degrades rapidly as the imbalance ratio rises,a novel method based on Prototypical Supervised Contrastive(PSC)Learning is proposed.In order to enhance presentation learning of model,it designs a two-branch neural network after the feature extractor.The network is composed of the feature learning branch trained by the PSC loss and the classifier learning branch trained by the label-distribution-aware margin loss with prototype similarity penalty.The method proposed above is more inclined to represent minority concepts,which could help the model maintain a balanced feature space with clear classification boundaries for imbalanced data.2.To address the issue that existing methods improve the performance of minority classes at the cost of the performance of majority classes,this thesis propose a simple but effective mixture-of-experts algorithm,called Adaptive aggregation of Skilled Diverse Experts.Each expert network is trained independently guided by loss functions of different action mechanism,so that each expert has their own skewed classes.In order to find the best combination of individual experts,an adaptive weight learning module is designed to assign higher weights to trusted expert models.This algorithm improves the classification performance of all classes effectively.It owns stronger generalization ability and more stable performance.3.Facing the practical application needs of content security review,this thesis designs and implements a Chinese bad text classification system which has a simple interactive interface and is easy to operate.This system integrates the advanced algorithms proposed in the thesis,which can help reviewers undertake a large proportion of text review work.The system has certain practicability.This thesis has conducted extensive experiments and analysis on the above two proposed methods using self-constructed dataset and public benchmark datasets,and the results have verified the innovation and progressiveness of the two methods.In additional,the trained models are deployed to the actual application system,further exerting their value.Therefore,the research of the thesis has both theoretical and practical significance. |