Font Size: a A A

Research On Chinese Text Classification Based On Character-level Convolutional Neural Networks

Posted on:2019-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2428330545954463Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the mobile interne China,the explosive growth of mobile Internet users has made it possible for everyone to become a small self-media.Its content is also disseminated mainly in the form of text and forms a new content-based information.era.Hundreds of millions of content are generated on various platforms every day,such as news content,self-media articles,product reviews,etc.How to use these contents and tap their potential value is an important task for natural language processing.Text categorization as one of the primary tasks is to categorize these texts into pre-specified categories so that the burden on staff is reduced.However,in today's text content,there are more or less misspellings such as non-standard content and typos,which leads to a decrease in the classification effect of models based on the use of words as text features.Using a model based on words as a feature is also affected by the word segmentation step.The quality of the word segmentation determines the quality of the final classification to some extent.Similarly,with the increase in the number of mobile devices,mobile devices have limitations such as memory,and how to place their actual applications on the mobile device is a problem that needs to be solved.Based on the above considerations,in this paper,we study the character-level convolutional neural network model,summarized as follows:Research on coding methods for Chinese character representation.The encoding methods used for the Chinese character representation include pinyin encoding,UTF-8 encoding,picture encoding,random character embedding vectors,and pre-trained character embedding vectors.The paper compares the representations of these five characters in an all-round way and analyzes the characteristics of various methods.Research on the embedded character vectors of pre training Chinese characters.Two novel pre-training methods for character vector embedding vectors are proposed,both of which use the idea of unsupervised learning similar to that of Skip-Gram to model the learned embedded vectors.The learned character embedding vector will contain some knowledge about syntactic structure and semantic structure,and thisway to optimize the final character-level convolutional neural network.The learned Chinese character embedding vector is incorporated into the training in the convolutional neural network,and the trained model obtains the best classification effect.Compared with the convolutional neural network model using other Chinese character representation methods,the proposed method has improved speed of operation,does not require word segmentation assistance when applied,and solves the Out of Vocabulary(OOV)problem in a better way.
Keywords/Search Tags:Text classification, Convolutional neural networks, Character embedding, Chinese character coding
PDF Full Text Request
Related items