| Offensive language refer to user comments that contain offensive tendencies such as insults,hatred or aggression.As user-generated content on the Internet increases day by day,offensive language also increase accordingly.To facilitate analysis and decision making by platforms and relevant authorities,offensive language identification has become an important research topic.In recent years,offensive language identification is mainly based on deep learning methods,and there is no publicly available Chinese offensive language dataset in Chinese scenario,which is not conducive to the training of models.In order to solve this problem,A transfer learning approach is used to generate a Chinese offensive language training dataset with the help of the OLID(Offensive Language Identification Dataset)English offensive language training dataset,and designs a Chinese offensive language identification model,fine-tunes the model using the Chinese offensive language training dataset,and then uses the model for Chinese offensive language identification.The main work of this thesis includes.(1)The back-translation-based training dataset transfer method is investigated:the original English comments in the OLID English offensive language training dataset are translated using Google Translator to obtain the Chinese translated comments,and the Chinese translated comments are back-translated using Google Translator to obtain the English back-translated comments;the BERT-based English offensive language identification model is used to predict the labels of English original and English back-translated comments,and the calculation method of LDOL(Label Distance for Offensive Language)is proposed to determine the labels of Chinese translated comments,so as to generate the training dataset of Chinese offensive language;the evaluation method of transfer quality of the training dataset is designed,and the transfer quality evaluation experiment of the training dataset is conducted to verify the effectiveness of training dataset transfer.The results of the experiments show that the labels of the generated Chinese offensive language training dataset are consistent with the manual labeling results and can be used for model training.(2)A Chinese offensive language identification model XLM-R-COLIMo Co based on the pre-trained model XLM-R and fine-tuning is designed:a Chinese offensive language identification baseline model XLM-R-COLI based on the pre-trained model XLM-R and fine-tuning is designed,and the Chinese offensive language training dataset generated by the back-translation-based training dataset transfer method is used to fine-tune XLM-R-COLI,then XLM-R-COLI is used for Chinese offensive language identification;the baseline model XLM-R-COLI is improved by using supervised contrast learning in the fine-tuning process,combining the supervised contrast learning loss function with the cross-entropy loss function,and using the new loss function as the training target of the model to obtain the XLM-R-COLIMi Co;to improve supervised contrast learning,momentum contrast learning is introduced and XLM-R-COLIMo Co is designed;parameter tuning experiments are conducted to determine the hyperparameters of XLM-R-COLIMi Co and XLM-R-COLIMo Co by using grid search and cross-validation.Comparative experiments show that XLM-R-COLIMo Co has better performance in the Chinese offensive language identification task compared with other models,verifying the effectiveness of XLM-R-COLIMo Co.(3)A Chinese offensive language identification prototype system is designed and implemented with the XLM-R-COLIMo Co as the core,and the usability of the model is verified.The main innovations and contributions of this thesis are.(1)In the process of training dataset transfer,the calculation method of label distance for offensive language LDOL is proposed,and Chinese training data generation algorithm based on back translation and label distance is proposed to determine the labels of the generated Chinese offensive language dataset.(2)In the fine-tuning process of the Chinese offensive language identification model,supervised contrast learning is introduced,and supervised contrast learning is improved by momentum contrast learning,which improves the detection performance of the model for Chinese offensive language. |