Font Size: a A A

Research On Chinese Text Classification Based On Deep Learning Theory

Posted on:2020-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:W H LaiFull Text:PDF
GTID:2428330590984585Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Text categorization is one of the key technologies in information mining,and has been widely used in news categorization,sentiment analysis and public opinion supervision.The traditional text representation method based on Bag-of-words model and vector space model still has the problems of insufficient feature extraction ability and large loss of feature information.However,the categorization algorithm based on traditional statistical learning and machine learning will have limited categorization performance and model generalization ability when facing more complex text structure,multi-categorization,data imbalance and other problems.This thesis mainly studies the categorization technology of Chinese text in two aspects: text representation method and deep learning model.It combines Chinese text representation method with excellent deep learning algorithm to achieve ideal categorization effect in text categorization tasks.The research work of this thesis includes the following aspects:1.Chinese text categorization based on character-level Convolutional Neural Network.Aiming at the problem of multi-categorization of Chinese texts,a categorization method based on character-level text representation and Convolutional Neural Network is proposed.Firstly,based on this task,a Chinese character dataset with a scale of 575000 and its corresponding three Pinyin format datasets are constructed.For Chinese character dataset,a character dictionary is constructed with Chinese characters and punctuation marks;For Pinyin format datasets,a character dictionary is constructed with Pinyin letters,numbers and punctuation marks.Then,based on the four character dictionaries,corresponding characterlevel text representations are respectively established as the input of the model.Finally,the model is trained and tested on Chinese characters and their corresponding three Pinyin format datasets.The experimental results show that the performance of the model on the Chinese character dataset is better than its corresponding Pinyin format dataset.In addition,the model is also compared with previous models on the same data set.The results show that the appropriate character dictionary and Convolutional Neural Network hyperparameter play an important role in the task of Chinese text categorization.2.Chinese sentiment categorization analysis based on attention mechanism and bidirectional Independent Recurrent Neural Network.Aiming at the problem that sentiment analysis,which is the subdivision direction of text categorization,needs to extract rich semantic features,a categorization method based on word embedding,attention mechanism and bidirectional Independent Recurrent Neural Network is proposed.Firstly,the original Chinese text is preprocessed by removing punctuation marks and special symbols.Then,word segmentation tools are used for Chinese word segmentation,and Skip-Gram model and Wikipedia Chinese corpus are used for word embedding training of the segmented text.Secondly,each word in the text is represented by its corresponding word embedding,and the word embedding sequence representing each text is used as the input of bidirectional Independent Recurrent Neural Network to extract the semantic features of the text.Finally,attention mechanism is introduced to give higher weight to those words that can express emotion,so that the final feature vector representing text contains both semantic information and weight information of each keyword.On the same data set,the model proposed in this thesis is compared with LSTM,bidirectional LSTM,GRU and deep IndRNN in experiments.The results show that compared with other models,the model proposed in this thesis has higher accuracy and F1 score in sentiment analysis tasks.It shows that the model with multilayer stacking IndRNN and attention mechanism can extract more comprehensive and rich semantic information and make the model obtain better performance.3.Finally,a mixed text categorization system is designed by combining characterlevel Convolutional Neural Network and bidirectional Independent Recurrent Neural Network model with attention mechanism.
Keywords/Search Tags:text categorization, characters, Convolutional Neural Network, attention, Independent Recurrent Neural Network
PDF Full Text Request
Related items