With the Internet so advanced in modern times,there is a huge amount of information being shared and received every moment.Text is a basic carrier of information,and it is particularly important to classify it efficiently in the face of the huge and cluttered text data.Learning potential features for classification from a large amount of labeled text data and then automating the classification of unlabeled data is the main way of text classification methods.However,in real-world scenarios labeled data is often difficult to obtain while unlabeled data is cheap and abundant.In addition,most current text classification models perform poor on micro-topic classification tasks with high semantic similarity.Therefore,to address the strong reliance on high-quality labeled data and the lack of satisfactory classification results under micro-topics,this thesis designs and implements a weakly supervised micro-topic text classification method and develops a system to use with it.This thesis first investigates the traditional methods of text classification,the application of weakly supervised techniques,and the micro-topic classification method,and then improves and perfects the techniques in several aspects.Finally,this thesis forms the classification method in a well-designed system.Specifically,this method consists of four steps:(1)fine-grained label construction.The BERT model is used to predict the coarse-grained category labels and generate a fixed-size word list with high relevance as the basis for selecting micro-topic labels(fine-grained labels).(2)Construct the initial labeled dataset.The initialization of the weakly supervised set is done according to two strategies: "direct inclusion of label names" and "relevant word list coverage",which is the source of the training set needed for the final classification model.(3)Expanding the labeled dataset.The GPT model is used to generate more micro-topic pseudo-labeled intances.By improving the loss function of the GPT model,the hierarchical croase-tofine label information is exploited,which makes the quality of the generated micro-topic pseudo-labeled dataset improved.(4)Classifier iterative training.The pseudo-labeled dataset is used for the training of the classifier.The weakly supervised dataset is updated by an iterative method to improve the quality of the weakly supervised set and thus the quality of the text classifier.In this thesis,the effectiveness of the designed method is verified by several experiments.The comparison experiment against different weakly supervised classification methods shows that the weakly supervised text classification method designed in this thesis has better classification results than other methods.The final F1-Score of the classification result can reach 86.9% and it is similar to supervised methods.The ablation experiments show that the improvement of the links finally has a positive effect on the classifier training.Finally,a system for text classification is designed and implemented,which provides a source of data needed for downstream applications of natural language processing. |