With the rapid development of mobile Internet and big data technologies,short texts,as a lightweight,easy-to-disseminate,and understand text form,are gradually becoming an important vehicle for people’s communication and participation behaviors on the Web.Mining valuable information from these data and classifying them is an urgent problem facing the field of natural language processing.However,short texts are not only high-dimensional and sparse,large in number,and fast in updating,but also often have category imbalances interfering with classification.In addition,the challenge of short text classification is further exacerbated by insufficient training samples(Few-Shot problem).In this thesis,we conducted research to address the short text imbalance classification in the case of few-shot.First,we investigate the use of semi-supervised clustering to assist in classification and improve short text imbalance classification by resampling techniques.Secondly,a combination of co-training and Boosting techniques is investigated to further improve the performance of short text classification by mitigating the shortage of training samples through pseudo-labeled samples.Finally,a prototype system for short text classification is developed based on the above research.The main research of this thesis is as follows:(1)A short text resampling algorithm based on semi-supervised hierarchical clustering.Cluster analysis is widely used as a pre-process for discovering data distributions for resampling.In this thesis,a semi-supervised hierarchical clustering algorithm is proposed,which can effectively show the characteristics of data distribution and reveal inter-class imbalance and intra-class imbalance.Based on this clustering algorithm,a semi-supervised short text resampling algorithm is designed,which uses the semi-supervised clustering results to guide hybrid resampling,undersampling the labeled data of the majority class based on its distance to the cluster prime and the adjacency of the cluster prime of the minority class,and oversampling the unlabeled data of the minority class based on the confidence level.Compared with existing short text oversampling methods,this method possesses the ability to mine largescale unlabeled data and helps to discover more details of minority class distribution.(2)Short text imbalance classification algorithm combining co-training and Boosting.In short text imbalance classification,resampling is an effective method to alleviate the imbalance between classes,but it is difficult to solve the problems of overfitting due to oversampling and losing information due to undersampling.In this thesis,we propose a short text imbalance classification algorithm combining co-training and Boosting to effectively alleviate the fewshot problem by pseudo-labeled samples,and further use pseudo-labeled samples to modify the class distribution to alleviate the class imbalance.The method selects pseudo-labeled samples from unlabeled samples predicted to be in the minority class,which can reduce the sample size difference between classes.It is further combined with Boosting technique to improve the cotraining of imbalanced datasets by adaptively weighting the training data.This method can make the traditional classifier better adapted to the imbalanced data distribution.(3)Design and implementation of a short text imbalance classification system.The system is designed and implemented on Pycharm.It is divided into preprocessing and feature extraction,category balancing,model training,and classification modules.This prototype system has a simple interface design and good interaction,which verifies the feasibility of the proposed method and the practicality of the system design. |