Font Size: a A A

Research On Text Outlier Detection Based On Convolution Neural Network

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2428330575977328Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of machine learning and deep learning,the demand for high quality training data is increasing.At present,the methods for obtaining high-quality data mainly include searching for existing literature materials,searching for open source data sets,etc.But when it comes to specific fields and facing specific problems,it is often difficult to directly find effective high-quality data,which needs to be based on manual collection.Do further processing.At this time,there are two kinds of dilemmas.One is that the collected data contains a small amount of noise data.The other is that only a small amount of data is needed in the collected data,and the other large amount is noise data.For both cases,we need to choose an effective method to select the required data and eliminate the unwanted noise data.Researchers in the academia and industrial circle have proposed a variety of outlier detection methods to solve this problem,including frequency-based methods,statisticalbased methods,depth or distance-based methods,machine-based learning methods and so on.These methods have achieved good results for structured data,but many methods have had little effect when faced with unstructured data,especially textual data.This paper attempts to introduce convolutional neural networks into text outlier detection and improve them.The specific work includes:1.The characteristics of text outliers are analyzed,and a text outlier detection method based on convolutional neural network is proposed.Compared with the characteristics of the recurrent neural network inputting data according to the time step,the pooling operation of the convolutional neural network will lose part of the position information,which is more in line with the characteristics of out-of-order text detection.At the same time,its convolution operation mimics the n-gram language model very well.2.This paper proposes a complete process approach from target data to building a control set to word vector pre-training and gain training,model training and iteration.In this paper,the Xenc tool is used to calculate the cross entropy of the extra-domain data and the intra-domain data,and the sorted data is sorted out according to a certain proportion.The word set vector model is pre-trained with large corpus,and then the domain corpus is added.The way of gain training balances the relationship between the information of the word itself and the scene of the word used;the method of training iteration is used to continuously approach the expected effect.3.For the text outlier detection of short text,a convolutional neural network model with morphological features is proposed,and the experiment is designed to validate it effectively.The syntactic information of many colloquial sentences is wrong,but the majority of lexical information is preserved.The introduction of part of speech information can better expand the information dimension,especially for directive statements.4.Aiming at the text on the small dataset,the method of first encoding the words and arranging the words to expand the data set after the position coding is proposed.The amount of data is effectively expanded while keeping the original text information as much as possible.The validity was verified by experiments.
Keywords/Search Tags:convolutional neural network, outlier detection, data selection, part of speech feature, position embedding
PDF Full Text Request
Related items