Font Size: a A A

Classification Of Sexual Harassment Dialogue Texts Based On BERT-CNN

Posted on:2022-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:M R YanFull Text:PDF
GTID:2518306770971739Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In recent years,the rise of online social media platforms has further expanded the range of users with the popularity of the mobile Internet,and more and more people are posting and sharing their views and opinions about different incidents on these social media platforms.However,the content of communication on these platforms is often not well filtered,which is likely to result in users sending or receiving different kinds of sexually harassing messages on different social media platforms.Such messages can have a serious negative impact on a person,leading to low selfesteem and depression,and even self-destructive,suicidal and anti-social behaviour.It is important to note that this type of social phenomenon is already commonplace.However,the vast amount of data generated on these platforms daily makes it difficult for regulators or relevant practitioners to audit this on a case-by-case basis.Moreover,online discourse is a dynamic process,and it is difficult for reviewers to provide a qualitative criterion to distinguish these texts from sexual harassment.Therefore,it would be interesting to be able to automatically detect and classify such messages.This dissertation proposes an automatic classification model for sexually harassing conversational texts,and the main contributions of this thesis are as follows:(1)In response to the lack of a dataset on sexual harassment conversations in the Chinese domain,this dissertation constructs a dataset on sexual harassment conversations in the Chinese domain.After cleaning the sexual harassment conversations,the different layers are annotated.The annotation is based on the harassed person's tolerance level of the discourse,and the dialogue text is divided into four levels,with the level of sexual harassment increasing step by step.The dataset is also enhanced using a translation-based approach to make the dataset more evenly distributed,which is conducive to the model learning more semantic information during the training phase.(2)To address the problem that the accuracy of keyword matching methods is too poor for frequent misclassification.In this dissertation,a pre-trained model BERT is used to generate word vectors instead of the traditional word embedding model.The comprehensive semantic information in the sentence is represented by the token [CLS], followed by a linear classifier to classify the [CLS] token.(3)To address how the rich semantic information in BERT can be further exploited.This thesis extends the output of BERT with a multi-layer convolutional neural network to further extract features and use maximum pooling for dimensionality reduction,followed by a Re Lu function to reduce the computational effort,and finally a linear classifier for classification.(4)The model proposed in this paper has been demonstrated to be effective in classifying conversational texts in the network through extensive experiments.The experiments also show that the convolutional neural network-based model is more likely to classify correctly when the main semantic meaning of the sentence is determined by a few local key texts.
Keywords/Search Tags:natural language processing, deep learning, feature extraction, text classification, crime prevention
PDF Full Text Request
Related items