Font Size: a A A

Research Of Text Classification Based On Word2vec And Self-attention

Posted on:2020-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:B J ZuoFull Text:PDF
GTID:2428330596495384Subject:Control engineering
Abstract/Summary:PDF Full Text Request
As one of the basic tasks of natural language processing,text classfication has been widely used in sentiment analysis,news classification and other fields.Text representation and feature extraction are two important factors affecting the performance of text classification,which determine the upper limit of text categorization effect.Nowadays,text representation is usually training based on large external corpus.But it is difficult to solve the problem of Out of Vocabulary(OOV).In the feature extraction filed,the model based on convolutional neural network(CNN)or recurrent neural network(RNN)is usually used to extract text features automatically.The model structure may lose some text information in the training process.Therefore,to get more semantic information in text representation and construct a model that can fully extract text feat ures has become a difficult and hot topic in text categorization research.In view of the above two problems,the research work is as follows:Firstly,CP_word2vec method word vector initialization is proposed based on word2 vec model.This method can effectively solve the problem of unknown words in training set.It prevents the damage of word vector space caused by too many random initialization vectors and reduce the influence of external interference such as word segmentation errors and spelling errors,provides richer semantic information for subsequent feature extraction stage.Secondly,a hierarchical neural network model HTN is proposed based on Transformer model.Transformer can consider the relationship between each word in the document through self-attention mechanism.Compared with CNN and RNN,it can extract more feature and information.This paper fully considers the hierarchy of the document structure,and builds models in sentence level and document level,so that the model can extract text information from the word level to the sentence level,and then to the document level.Then,combining CP_word2vec and HTN models,a new model CPW_HTN is proposed,which combines the advantages of the two methods and improves the effect of text categorization significantly.Finally,this paper first makes an experimental analysis of CP_word2vec method through two emotional analysis datasets.The results show that under the same conditions,Compared with word2 vec,CP_word2vec has better performance in 2 segmant ation data sets.Then,the CPW_HTN model is analyzed by two news datasets,and seven deep learning models are selected for comparison.The results show that compared with other deep learning models,the CPW_HTN model proposed in this paper achieves the bes t classification accuracy.To sum up,this paper improves the initial word vector and constructs a hierarchical deep learning model to retain and extract the effective information of the text more fully and further improve the accuracy of text classification.
Keywords/Search Tags:Text classification, Text representation, Word2vec model, Text feature extraction, Self-attention mechanism
PDF Full Text Request
Related items