Font Size: a A A

Research On Chinese Text Classification Based On Deep Learning

Posted on:2021-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:G H WangFull Text:PDF
GTID:2428330602995157Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the application of natural language processing gradually coming to the ground,text classification technology,as one of its basic technologies,has been widely studied.Improving the accuracy of text classification is an important measure to promote the technology landing.The premise for computer to complete text classification is to enable computer to recognize text.Text representation technology is to transform text into data that can be recognized by computer.However,in the process of text conversion,there will be information loss,which leads to classification errors.Therefore,it is very important to reduce the loss of information in the text representation.Through the analysis of the current situation of text representation research,it is found that domestic research is greatly influenced by foreign research,and foreign English text research takes words as semantic units,so domestic research on Chinese is also based on words.But Chinese and English are different.After thousands of years of development,Chinese characters contain information.Traditional text classification technology is mostly based on shallow machine learning technology.With the rise of deep learning,it is gradually applied to the field of natural language processing,especially convolutional neural network and long-term and short-term memory network are the most representative,but these networks have their own shortcomings in text classification.For text,order is very important,but the text feature extracted by convolution neural network is disordered;highlighting the main information of text is helpful to improve the accuracy of text classification,but the text feature extracted by long-term and short-term memory network is unable to highlight the main information of text.This paper improves the text representation to reduce the information loss in the text representation,so as to improve the accuracy of text classification.The improved text representation uses word and character as semantic units,and uses word2 vec technology based on skip-gram model to pre train word and character vectors in word space and character space.The new text representation is obtained by combining word and character vectors,and the improved text representation is classified by machine learning algorithm.Based on the application of improved text representation in machine learning algorithm,this paper further explores the applicability of improved text representation in deep learning,and further improves the accuracy of text classification based on machine learning algorithm.The neural network applied to text data is a single layer embedding layer,which takes a single text representation data as the input.In order to use the improved text representation in neural network,this paper designs a double layer embedding layer in neural network,one layer takes the word embedding data as the input,the other layer takes the character embedding data as the input.Text is continuous data,and order is very important,but the text features extracted by convolution neural network are disordered,while the features extracted by long-term and short-term memory are orderly;highlighting the main features of text is helpful to improve the accuracy of text classification,but the text features extracted by long-term and short-term memory network cannot highlight the main features of text,while convolution network can extract the main features of the text.Therefore,based on the advantages and disadvantages of convolutional neural network and long-term and short-term memory network,this paper designs three new hybrid network structures,namely,C-LSTM network,lstm-CNN network and CNN-LSTM-Parallel network,and sets the third goal,that is to improve the accuracy of text classification by more than one percentage point on the basis of convolutional network and long-term and short-term memory network.According to the experimental results,in the machine learning algorithm and deep learning algorithm,for the Macro-F1 value,the improved text representation improves the classification accuracy by about 1% compared with the word embedding representation,even by about 3% in the Sogou laboratory data,and generally by more than 3% compared with the character embedding representation.At the same time,by comparing the classification results under the same text representation,the classification accuracy of deep learning algorithm is generally 1% to 2% higher than that of machine learning algorithm.In the improved network,the classification accuracy of lstm-CNN network is the highest among all text representation data,and its classification accuracy is about 3% higher than that of machine learning algorithm,and 1% to 2% higher than that of convolution network and long-term and short-term memory network.which shows that the network combines the advantages of convolution network and long-term and short-term memory network.
Keywords/Search Tags:Text classification, text representation, word embedding, character embedding, hybrid network
PDF Full Text Request
Related items