Font Size: a A A

Research On Illegal Information Classification Technology Based On Encoder And Convolutional Neural Network

Posted on:2020-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:X P MaFull Text:PDF
GTID:2416330623456779Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,more and more illegal information appears on the Internet,such as pornography,violence,drugs and other information.How to detect such illegal information has become a key research issue.Illegal information detection belongs to the text classification task in natural language processing.However,compared with the traditional text classification,the network search query text has many characteristics such as much noise,short length and many new words.So how to construct an efficient text representation model and a text feature extraction model is the focus of this paper.Text categorization techniques involve many techniques,such as natural language processing,data mining,and so on.There are many factors that affect the accuracy of text classification.They include text preprocessing,text feature representation,feature selection,classifier selection and optimization.Traditional text representation methods,such as Boolean model,vector space models,have problems such as data sparseness and dimension disasters.In order to further mine hidden information in text,a distributed vector representation method based on neural network has emerged,such as word2 vec.This distributed vector only contains the semantic information of words,ignoring the attribute information of words.Meanwhile,many feature extraction methods are ignoring the structural information of the text.Based on the above research,the following work has been done:(1)The text feature representation model has been improved.A variety of model fusion text representation models(LMCW)have been proposed.The method first uses the word2 vec tool to train a distributed word vector containing semantic information on the search query data set.Then it uses the mutual information of the vocabulary to weight the word vector.Next,it uses the word2 vec tool to train the distributed word on the Wikipedia data set.The vector is then represented by the prior knowledge and the lexical attribute abstraction.Finally,the three vectors are fused and spliced to form a vector containing semantic information,lexical attribute information and external information.The validity of the model was verified by testing on the SVM classifier.(2)Based on the method(1),two text feature extraction methods are proposed: one is the text feature extraction model based on LMCW and Transformer(LMCWT),another is the text feature extraction model based on LMCWT and CNN.Based on the LMCW text feature representation method,the LMCWT model extracts context information from the search query text through the Transformer encoder,and trains the feature vector including semantic information,lexical attribute information,external information vector and context information.Based on the feature vector expressed by LMCWT model,the text feature extraction model based on LMCWT and CNN extracts the structure information of search query text by introducing CNN network,and then learns the text local information,which realizes multiple levels and multiple angles.The text was extracted.Through the test on the search query data set,the LMCWT-CNN model has a higher classification effect than other classification models,and the accuracy of text classification is further improved than the LMCW model.
Keywords/Search Tags:Text Classification, Word2vec, Text Representation, Feature Extraction, CNN
PDF Full Text Request
Related items