Chinese Language Identification Based On Transfer Learning

Posted on:2023-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:J Xu

Full Text:PDF

GTID:2568307061950739

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Offensive language refer to user comments that contain offensive tendencies such as insults,hatred or aggression.As user-generated content on the Internet increases day by day,offensive language also increase accordingly.To facilitate analysis and decision making by platforms and relevant authorities,offensive language identification has become an important research topic.In recent years,offensive language identification is mainly based on deep learning methods,and there is no publicly available Chinese offensive language dataset in Chinese scenario,which is not conducive to the training of models.In order to solve this problem,A transfer learning approach is used to generate a Chinese offensive language training dataset with the help of the OLID（Offensive Language Identification Dataset）English offensive language training dataset,and designs a Chinese offensive language identification model,fine-tunes the model using the Chinese offensive language training dataset,and then uses the model for Chinese offensive language identification.The main work of this thesis includes.（1）The back-translation-based training dataset transfer method is investigated:the original English comments in the OLID English offensive language training dataset are translated using Google Translator to obtain the Chinese translated comments,and the Chinese translated comments are back-translated using Google Translator to obtain the English back-translated comments;the BERT-based English offensive language identification model is used to predict the labels of English original and English back-translated comments,and the calculation method of LDOL（Label Distance for Offensive Language）is proposed to determine the labels of Chinese translated comments,so as to generate the training dataset of Chinese offensive language;the evaluation method of transfer quality of the training dataset is designed,and the transfer quality evaluation experiment of the training dataset is conducted to verify the effectiveness of training dataset transfer.The results of the experiments show that the labels of the generated Chinese offensive language training dataset are consistent with the manual labeling results and can be used for model training.（2）A Chinese offensive language identification model XLM-R-COLI_{Mo Co} based on the pre-trained model XLM-R and fine-tuning is designed:a Chinese offensive language identification baseline model XLM-R-COLI based on the pre-trained model XLM-R and fine-tuning is designed,and the Chinese offensive language training dataset generated by the back-translation-based training dataset transfer method is used to fine-tune XLM-R-COLI,then XLM-R-COLI is used for Chinese offensive language identification;the baseline model XLM-R-COLI is improved by using supervised contrast learning in the fine-tuning process,combining the supervised contrast learning loss function with the cross-entropy loss function,and using the new loss function as the training target of the model to obtain the XLM-R-COLI_{Mi Co};to improve supervised contrast learning,momentum contrast learning is introduced and XLM-R-COLI_{Mo Co} is designed;parameter tuning experiments are conducted to determine the hyperparameters of XLM-R-COLI_{Mi Co} and XLM-R-COLI_{Mo Co} by using grid search and cross-validation.Comparative experiments show that XLM-R-COLI_{Mo Co} has better performance in the Chinese offensive language identification task compared with other models,verifying the effectiveness of XLM-R-COLI_{Mo Co}.（3）A Chinese offensive language identification prototype system is designed and implemented with the XLM-R-COLI_{Mo Co} as the core,and the usability of the model is verified.The main innovations and contributions of this thesis are.（1）In the process of training dataset transfer,the calculation method of label distance for offensive language LDOL is proposed,and Chinese training data generation algorithm based on back translation and label distance is proposed to determine the labels of the generated Chinese offensive language dataset.（2）In the fine-tuning process of the Chinese offensive language identification model,supervised contrast learning is introduced,and supervised contrast learning is improved by momentum contrast learning,which improves the detection performance of the model for Chinese offensive language.

Keywords/Search Tags:

Offensive Language Identification, Transfer Learning, Pre-Trained Mode, Supervised Contrastive Learning, Momentum Contrastive Learning

PDF Full Text Request

Related items

1	Research On Self-Supervised Contrastive Learning Video Representation Model
2	Research On Contrastive Learning Method Based On Nearest Neighbor Optimization And Momentum Updat
3	Research On Anomaly Detection Of Log Data Based On Contrastive Learning And Word Embedding
4	Algorithmic Studies On Knowledge Enhanced Pre-trained Language Models
5	Research On Knowledge Base Question Answering Model Based On Contrastive Learning
6	Augmentation Of Pre-Trained Model For Programming Language Based On Structure Information
7	Research On Text Clustering Based On Self-Supervised Contrastive Learning
8	Contrastive Learning Based Person Re--identification
9	Basic Algorithm And Framework Research On Contrastive Learning
10	Research On Weakly-supervised Learning Based On Sample Selection Strategy And Contrastive Learning