Research On Illegal Information Classification Technology Based On Encoder And Convolutional Neural Network

Posted on:2020-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:X P Ma

Full Text:PDF

GTID:2416330623456779

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,more and more illegal information appears on the Internet,such as pornography,violence,drugs and other information.How to detect such illegal information has become a key research issue.Illegal information detection belongs to the text classification task in natural language processing.However,compared with the traditional text classification,the network search query text has many characteristics such as much noise,short length and many new words.So how to construct an efficient text representation model and a text feature extraction model is the focus of this paper.Text categorization techniques involve many techniques,such as natural language processing,data mining,and so on.There are many factors that affect the accuracy of text classification.They include text preprocessing,text feature representation,feature selection,classifier selection and optimization.Traditional text representation methods,such as Boolean model,vector space models,have problems such as data sparseness and dimension disasters.In order to further mine hidden information in text,a distributed vector representation method based on neural network has emerged,such as word2 vec.This distributed vector only contains the semantic information of words,ignoring the attribute information of words.Meanwhile,many feature extraction methods are ignoring the structural information of the text.Based on the above research,the following work has been done:(1)The text feature representation model has been improved.A variety of model fusion text representation models(LMCW)have been proposed.The method first uses the word2 vec tool to train a distributed word vector containing semantic information on the search query data set.Then it uses the mutual information of the vocabulary to weight the word vector.Next,it uses the word2 vec tool to train the distributed word on the Wikipedia data set.The vector is then represented by the prior knowledge and the lexical attribute abstraction.Finally,the three vectors are fused and spliced to form a vector containing semantic information,lexical attribute information and external information.The validity of the model was verified by testing on the SVM classifier.(2)Based on the method(1),two text feature extraction methods are proposed: one is the text feature extraction model based on LMCW and Transformer(LMCWT),another is the text feature extraction model based on LMCWT and CNN.Based on the LMCW text feature representation method,the LMCWT model extracts context information from the search query text through the Transformer encoder,and trains the feature vector including semantic information,lexical attribute information,external information vector and context information.Based on the feature vector expressed by LMCWT model,the text feature extraction model based on LMCWT and CNN extracts the structure information of search query text by introducing CNN network,and then learns the text local information,which realizes multiple levels and multiple angles.The text was extracted.Through the test on the search query data set,the LMCWT-CNN model has a higher classification effect than other classification models,and the accuracy of text classification is further improved than the LMCW model.

Keywords/Search Tags:

Text Classification, Word2vec, Text Representation, Feature Extraction, CNN

PDF Full Text Request

Related items

1	Research On Chinese Case Text Understanding Technology
2	Pyramid Selling Recognition Based On Text Classification Using SVM
3	Design And Implementation Of Text Information Extraction And Classification Statistics System For Judgment Documents
4	The Study And Applying Of Ontology Theory In Public Security Case Description
5	Research On Representation And Classification Of Legal Issue Based On Hierarchical Clustering Of Multiple Semantic Factors
6	Research On Multi-class Standard Text Classification Algorithm For Identifying Key Legal Factors Of Judicial Judgment Documents
7	The Mining Of People’s Concern Based On Public Feedback And Suggestions From Government Platforms
8	Research On Text Classification Algorithms For Judgment Documents
9	Research On Public Security Business Text Mining Technology For Joint Cases Investigating
10	Text Similarity Calculation In The Judicial Field Based On Text Representation Learning