Font Size: a A A

Research Of Short-text Classification Method Based On Convolution Neural Network

Posted on:2017-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:H P CaiFull Text:PDF
GTID:2348330503983846Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology, mobile internet and internet industry, the number of internet users explosively grow. The internet products are gradually more mature, especially products with socially attributes such as Wechat and Microblog. With the participation of a large number of active users, billions of short-text data are produced on these platforms every day, including chat records and user comments. It is very important research significance and great application value, applying classification technology of short-text correctly to dig out the user's real intentions hidden behind the data has whether for government departments, research institutes or internet service providers.Deep learning concept has been proposed since 2006, which has made great breakthrough in some fields such as image and speech recognition. Large number of researchers have proven that the models based on deep learning theory are able to obtain better performance comparing to those based on the traditional machine learning algorithms. This paper attempts to explore more suitable feature extraction method of short-text data. And the convolution neural network model based on the deep learning theory is also proposed in the short-text classification techniques. Therefore, the main content of this paper includes the following works:First of all, this paper points out the processes of short-text classification tasks in detail, including pre-processing, Chinese segmentation, feature extraction, classification algorithm and other steps. On this basis, the individual characteristics of short-text data is analyzed in details. The problems in traditional methods of text classification are expounded and elaborated. This paper establishes the foundation on the future subsequent sections of this article, including the feature extraction and classification model.Secondly, in order to describe the semantic relationships between words in a continuous low-dimensional space more fully, namely, improve expression ability of the features. Chinese Wikipedia as a training data set is also introduced in addition to the original data set. This paper trains each word embedding of data sets by the Skip-Gram neural network in the feature extraction stage, instead of designing features by using the traditional artificially way. Then the composition of each sample's word embedding all together to form a two-dimensional matrix, as the distributed representation.Thirdly, this paper designs a convolutional neural network structure having three different sizes'convolution kernel, which can complete further automatic extraction process a variety of local abstract features based on the original input feature. In addition, at each iteration of the training process, the original input features will be updated as the model parameters. Experimental results show that comparing to traditional machine learning algorithms, including Support Vector Machine, Random Forest, Logistic Regression and so on. The proposed sentiment classification model based on word embedding and CNN has successfully improved classification accuracy by 5.04%.Finally, based on summarizing all the work of this paper, the direction of future work is prospected.
Keywords/Search Tags:short-text classification, convolutional neural network(CNN), word embedding, distributed feature, natural language processing(NLP)
PDF Full Text Request
Related items