Font Size: a A A

Research On Filtering Method For Uncivilized Text Based On Deep Learning

Posted on:2020-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiuFull Text:PDF
GTID:2428330578952892Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,social networking platform has been deeply integrated into people's lives.People can freely express their views on micro-blog,Baidu Post,news websites other social media platforms.Because the internet is open for everybody,many uncivilized texts are posted online,which can lead to negative effects to our society.In order to construct a harmonious online language environment,we have done some reseach on the filtering of uncivilized texts.In view of the huge number and variety of online texts,we proposed two deep learning based models for the classification and recognition of uncivilized text,respectively.The experimental results indicated that our models perform better than the existing text filtering methods.The main work of this paper includes the following three parts:Firstly,we constructed the uncivilized text dataset.Currently,there is relatively little research work on uncivilized text,and there is also lake of standard uncivilized text dataset for research.Aiming at the problem of the lake of data,we firstly crawled the text data of Sina Weibo,Baidu Tieba,Tencent News and other social media platforms,and then constructed an uncivilized text data set by manual annotation.Secondly,we constructed a classification model to distinguish uncivilized texts from normal texts.According to the characteristics of online uncivilized text,convolutional neu ral network is adopted in this paper to classify uncivilized text.Aiming at the problem of relatvie low precision of the segmentation for Chinese uncivilized texts,we proposed a parallel convolutional neural network model(CW-CNN model),in which Chinese characters and words are used in combination as the input.The CW-CNN model can in some extent solves the problem caused by the inaccurate segmentation of Chinese uncivilized words.The experiments were conducted on the dataset we build before.Compared with the exsiting CNN-based models,CW-CNN model improves the accuracy by 9.3%,recall rate by 9.9%and F1 value by 9.2%.Thirdly,we constructed an model to distinguish the texts with high degree of uncivilization from those with low degree of uncivilization.Convolutional neural network model does have a good effect on uncivilized text categorization tasks.However,it can only extract local features from text and ignore the long distance semantic relationship among texts.In order to overcome the limitation of convolutional neural network,we propose a deep learning model(BiLSTM-CNN model)for uncivilized text analysis,in which convolutional neural network,recurrent neural network and attention mechanism are used in combination to build the text classifier.Compared with CW-CNN model,BiLSTM-CNN model improves the accuracy,recall rate and F1 value by about 3.4%on the task of uncivilized text analysis.
Keywords/Search Tags:Uncivilized Text, Text Classification, Deep Learning, Convolutional Neural Network, Char Level Vector, LSTM, Attention
PDF Full Text Request
Related items