Font Size: a A A

The Abusive Language Detection Method Based On PU Learning And Transfer Learning

Posted on:2022-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZangFull Text:PDF
GTID:2518306551470944Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,the rise of social platforms such as WeChat,Twitter,and online games have promoted online information interaction between users,but it has also spawned the problem of uncontrolled and unrestrained use of abusive language online.Abusive language mainly refers to the forms of expression that vilify or offend an individual or a group,which has brought negative influence to the public environment and user experience.Therefore,it is of great significance to construct an automatic abusive language detection system for social platforms.In recent years,the task of abusive language detection has attracted many scholars in the natural language processing community.Most of the existing methods used by social platforms to construct abusive language detection systems belong to supervised learning,which requires a large amount of labeled data.Although there are already labeled data from the platforms such as Twitter and Facebook,but the abusive language characteristics of various platforms may be different,which makes these labeled data may not be able to use on other platforms.Besides,the use of manual labeling data is time-consuming and labor-intensive.This paper regards the abusive language obtained from the complaint systetem of social platforms as the positive samples,and proposes an abusive language detection algorithm based on Positive and Unlabeled(PU)Learning.In addition,due to the fact that there are fewer positive samples in the early stage of platform operation,which makes the abusive information causes learned by the model limited.This paper further introduces a Transfer Learning method to improve the effect of abusive language detection.The specific work of this paper is as follows:(1)The existing annotation data is mainly based on platforms such as Twitter and Facebook,and the characteristics of abusive languhage on various platforms may be different.Therefore,it may not be suitable to use the annotated data of other platforms to construct an abusive languhage detection system for the social platforms that lack annotated data.and manual annotation data is time-consuming and labor-intensive.Considering that the complaint system of social platforms can obtain positive samples easily,and the negative samples are difficult to obtain,this paper proposes an abusive language detection algorithm based on PU learning.The algorithm uses positive samples and unlabeled samples for training and regards unlabeled samples as negative samples with smaller weights,which reduces the cost of model training..In order to verify the effectiveness of this method,this article conducted experiments on the offensive detection dataset(Offsensive 2019),and the results show that the proposed method can achieve the results close to the supervised learning method under the condition of only positive samples.(2)There are only few positive samples reported in the early stage of platform operation,which makes the abusive information causes learned by the model limited.In response to this problem,this paper further proposes an abusive language detection algorithm based on Transfer Learning.This method regards the abusive language detection for the dirty words dictionary and the Oxford advanced dictionary as the source field,and the abusive language detection for social platform as the target field.The knowledge in the source field is transferred to improve the performance of the target field model.In order to verify the effectiveness of transfer learning,this paper conducts experiments on the basis of PU learning method and supervised learning method.The results show that the F1 value of this method has got an overall improvement.
Keywords/Search Tags:social platform, abuse language detection, complaint system, PU learning, transfer learning
PDF Full Text Request
Related items