Font Size: a A A

Research On Text Classification Of Chinese News Based On Deep Learning

Posted on:2019-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2428330569496085Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the big data era of 21st century,the information is on the explosive growth.As an important way of carrying information,text is classified automatically in a vast amount of data information so as to store,manage and retrieve,which has become a subject worth studying.Early text categorization which needs to define and refine the rules of classification artificially,and then construct a classifier manually according to this classification rules was mainly based on knowledge engineering,this method was both time-consuming and laborious.With the rise of machine learning,in order to overcome this difficulty,machine learning classification technology began to replace the early classification methods.However,traditional machine learning methods still need to spend a lot of time to build feature engineering.As a branch of machine learning,deep learning has caused a wide range of effects with the development of high performance computing in recent years.How to use deep learning to complete the task of natural language processing which including automatic text categorization has also become a research hotspot.The main work of this paper is to apply the deep learning model to Chinese text categorization.First of all,in the aspect of text representation,the traditional text representation methods often did not consider the position relationship between words and words,ignored the connection between contexts,and the dimension was sparse.This result causes the lack of semantic information.So,this paper adopts a distributed representation method of neural network based on distribution hypothesis-word embedding.In this paper,the neural network language model learn the Chinese word vector by unsupervised training a large number of Chinese news text corpus in the way of negative sampling.The experimental results show that using Word embedding as the text feature,models all obtain good results in the F1 value of test metrics for classification.Secondly,in the aspect of classifier construction,in order to solve the problem of the traditional machine learning methods need to spend time and effort to build the feature manually.In this paper,we design Convolution Neural Network and Long Short-Term Memory Network that are two common deep learning models which can automatically extract features.And then we apply these two models to the Chinese news text categorization experiment.Experimental results show that compared with traditional K-nearest,Naive Bayesian and SVM for the text categorization,the F1 value of test metrics for classification based on Convolution Neural Network and Long Short-Term Memory Network is better than the traditional machine learning models.Finally,in order to improve the classification performance of the model,the attention mechanism is introduced to solve the problem that the encoding and decoding of the Encoder to Decoder model in Natural Language Processing are connected only by a fixed semantic encoding,which leads to the loss of some information.This paper designs two models of CNN-attention and LSTM-attention based on the classical categorization models of Convolution Neural Network and Long Short-Term Memory Network.Experimental results show that the F1 value of test metrics for classification of these two models has been improved to some extent after introducing the attention mechanism compared with classical categorization models of Convolution Neural Network and Long Short-Term Memory Network.
Keywords/Search Tags:Deep Learning, Text Categorization, Convolution Neural Network, Long Short-Term Memory Network, Attention Mechanism
PDF Full Text Request
Related items