Font Size: a A A

Research On Network Uncivilized Text Classification Methods Based On Semi-supervised Learning Models

Posted on:2022-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YangFull Text:PDF
GTID:2518306347489584Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the vigorous development of the Internet,network data has grouted blowout.People are surrounded by all kinds of network information.A large number of uncivilized languages are mixed in this online information.It not only interferes with information transmission and knowledge sharing,but also hinders the healthy development of online culture.Furthermore,it seriously pollutes the cyberspace environment,and it will endanger the construction of national spiritual civilization.In order to provide users with a friendly cyberspace environment and solve the problems such as the lack of contextual information and special data,we use semi-supervised learning model to mine uncivilized information in network texts.The main research work of this thesis is as follows:First,we constructed vocabulary mining data set and text classification data set.In view of the lack of special data set of network uncivilized language in the current research,we adopt web crawler technology to successively collect netizens' comments on hot events on Weibo,Tieba and YouTube.Then we construct a special data set of network uncivilized language through data cleaning and manual annotation.Second,in order to solve the problem of semantic ambiguity of network texts,we first uses the mixed deep learning method to construct an Internet uncivilized dictionary,and then proposes the semi-supervised learning model(SSVAE-WD)to classify the Internet uncivilized texts.First of all,we introduce a custom network dictionary to train the word vector to obtain the text word vector representation.Then,we use the characteristics of the VAE to optimize the model.The decoder is more robust to noise by adding Gaussian noise into the mean network.The input information can be reconstructed accurately,and the loss of text information can be reduced to effectively express the characteristics of the uncivilized short text on the Internet.Finally,we tested the classification performance of the model under different uncivilized data sets of labeled networks,and the effectiveness of the proposed method is verified by experiments.Third,in order to alleviate the problem of model performance over-relying on labeled data,we propose the UDA-SR model to classify uncivilized texts.We used four data augmentation methods(Random Deletion,Synonym Replacement,Random Swap and Random Insertion)to augment the data of unlabeled uncivilized text on the network.The loss of unlabeled uncivilized data and augment unlabeled uncivilized data are calculated by using the idea of consistency regularization.And we calculate the loss of the pseudo-label and the real label by cross-entropy loss method.Compared with the supervised learning model BERT,the experimental results show that the UDA-SR model we proposed can get better classification results with the support of fewer labeled data sets.And increasing unlabeled data set,it is possible to further enhance the classification performance of the model.
Keywords/Search Tags:Network Uncivilized Language Text, Semi-Supervised Learning, Variational AutoEncoders, Unsupervised Data Augmentation
PDF Full Text Request
Related items