Font Size: a A A

Multi-target Sensitive Text Detection Based On Imbalanced Data

Posted on:2020-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:T LiangFull Text:PDF
GTID:2428330596475088Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As artificial intelligence seeps deeper and deeper into human natural language,NLP(Natural Language Processing)technology plays a more and more important role in human daily language processing,such as text classification,language translation,part-of-speech tagging and named entity recognition.And has made remarkable achievements.In big data's time,the language data sets that human beings come into contact with on a daily basis are a bunch of messy,unbalanced multi-label learning task data sets.Unlike the standard clean,category-balanced,and single-label datasets used in academic research,and there is no uniform and efficient approach to dealing with this type of dataset.The sensitive text detection studied in this paper is actually a kind of text classification task involving data imbalance and multi-objective learning.in practical tasks,so it is very important to train a model with high accuracy and good robustness on multi-label unbalanced data sets.The purpose of this thesis is to study the detection(classification)of sensitive text data in text datasets involving data imbalance and multi-objective learning characteristics The main research work is composed of four parts.(1)The character data is quantized and the character data is converted into real data.The concept of word vector is introduced,which is different from the existing training word vector models which are Skip-Gram model and CBOW model,because these word vector training models do not take into account word order information.Resulting in the loss of semantic information of a part of the original data in the trained word vector,we need to design a word vector training model-Char-Word model,which can contain word order information.(2)The advantages and disadvantages of the existing methods to deal with data imbalance are analyzed,and the fusion innovation is carried out according to the advantages and disadvantages of each algorithm.In this thesis,we design a good method to solve the problem of data imbalance by combining the characteristics of data sampling,synonym substitution,data synthesis and cost sensitivity.For the problem of multi-label learning in dataset,this thesis first analyzes the advantages and disadvantages of the existing binary association group methods and algorithm adaptation methods in solving multi-label learning,and then designs a method to solve the multi-label learning problem.Not only the relationship between the text content and the sub-tags is considered,but also the relationship between the sub-tags is introduced,so that the method can make full use of all the related information of the existing data in the training and not limited to a part of the associated information.(3)In order to improve the learning ability of text classification model,we design a network structure which can better extract data feature information and semantic information.In this thesis,the residual networks ResNet and Inception-v3 networks which perform well in the field of computer vision are analyzed.And combining the advantages of these two network structures,and carrying out innovation migration to form a new network structure NRI(NLP ReNet Inception),so that the network can not only be applied to the field of text classification,Moreover,compared with a single CNN network and RNN network,it can extract the feature information of text data more effectively.(4)The comparative experiments of various methods are carried out on ToxicComment data set.Firstly,based on the original data set,the experimental comparison with CNN network structure,Bi-LSTM network structure and our proposed NRI network structure is carried out.From the comparison results,we can see that the proposed NRI network structure can better learn the characteristics and semantic information of text data.Then we use the Char-Word method of training word vector,the existing Skip-Gram method of training word vector and CBOW method to initialize the text classification model respectively.From the classification effect of the model,Our word vector training method Char-Word is more effective.Then,in the comparative experiment of dealing with the problem of data imbalance,the method proposed by us to solve the problem of data imbalance makes the AUC value of the model be greatly improved.This shows the effectiveness of our proposed method to solve the data imbalance problem,and then carries on the comparative experiment on solving the multi-label learning problem,similarly,from the experimental results,our method is effective;Finally,we compare our method to solve the problem of multi-label unbalanced text classification with the existing methods at the present stage,The accuracy of the LT method in this paper is 0.914 on the verification set,0.921 on the test set,and 0.861 on the balance of the model,which exceeds the performance of other mainstream methods.In summary,it is shown that our proposed method is efficient in solving the problem of multi-label unbalanced text classification.
Keywords/Search Tags:NLP, deep learning, text-classification, imbalance data, multi-label learning
PDF Full Text Request
Related items