Font Size: a A A

Research On BERT-based Uncivilized Language Detection Method

Posted on:2024-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q C TaoFull Text:PDF
GTID:2568307178474164Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid popularization of mobile internet,social media has become the main channel for internet users to access information.The problem of online violence and uncivilized speech in social media is becoming more and more apparent,which has brought serious negative impacts on individuals and society.Researchers in the field of natural language processing are attempting to use deep learning methods to automatically detect uncivilized language,greatly improving the efficiency of detection.However,there are several issues in the current job: the lack of a Chinese dataset for uncivilized text,the lack of comparative analysis for classification methods of uncivilized text in Chinese,and the absence of methods capable of recognizing uncivilized language with fine granularity.In this paper,a systematic study on uncivil language detection methods based on BERT model is conducted,and the main research work includes the following aspects:First,a Chinese uncivil text dataset is constructed.Currently,the openly accessible uncivil text datasets on the internet are aimed at English,which greatly limits the development of research on Chinese uncivil language detection.Therefore,this paper constructs a Chinese uncivil text dataset,which includes binary classification text dataset,multi-label classification text dataset,and uncivil block recognition dataset.The binary classification dataset simply divides the text into civilized and uncivilized texts.In the multi-label classification dataset,a text corresponds to one or more labels.The classification criteria refer to the Toxic Comment Classification Challenge on Kaggle,including four types: "toxic","obscene","insult",and "threat".Unlike the text classification dataset,the uncivil block dataset annotates each character,allowing for precise identification of uncivil blocks in the text.Second,a comparative analysis of the performance of different models in Chinese uncivil text classification tasks is conducted.Currently,there are more methods for English uncivil text classification,and the effectiveness of these methods on Chinese text data needs further research.Therefore,this paper systematically compares and analyzes different model-based Chinese uncivil text classification methods.The pre-trained models compared include BERT and ALBERT,while also comparing with traditional machine learning methods.The experimental results on the binary classification dataset show that the BERT model-based methods achieving the best classification accuracy.Finally,a method for uncivil block recognition based on BERT model is proposed.Existing uncivil language detection methods usually employ a strategy of recognizing and blocking entire texts,which lacks flexibility in practical applications.Therefore,this paper proposes a more fine-grained uncivil language detection method.This method can identify and block uncivil text blocks within the text,rather than directly blocking the entire text,greatly improving the practicality of this method.The basic idea is to convert the text block recognition task into a named entity recognition task.This paper compares the proposed method with several benchmark methods.The experimental results on the uncivil block dataset show that the BERT-based method outperforms existing methods,achieving the best recognition results.
Keywords/Search Tags:Uncivilized Text, Text Classification, Named Entity Recognition, BERT Model
PDF Full Text Request
Related items