Font Size: a A A

Research On Text Representation And Classification Based On Neural Networks And Self-attention Mechanism

Posted on:2021-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhuFull Text:PDF
GTID:2428330611964272Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the Internet era,the explosive growth of information has prompted people to pay more and more attention to the potential value of massive textual data.Utilizing or improving the existing artificial intelligence technology,and then mining hidden information from massive text data,is the current research focus and difficulty in the field of Natural Language Processing.Text classification research involves several issues such as natural text content understanding and pattern classification.A text representation method with excellent training performance is a prerequisite for achieving text classification tasks.From One-Hot coding to distributed representation,to Neural Network Pre-trained Language Models,text representation methods have gained considerable development,laying a solid foundation for a large number of practical Natural Language Processing tasks.However,the text representation model with RNN(Recurrent Neural Network)as the core is usually a biased model,the captured semantic information is imbalanced and cannot be calculated in parallel.Although the text representation model with CNN(Convolutional Neural Network)as the core can be calculated in parallel,it cannot capture the long-distance dependencies between words.Recently popular neural network Pre-trained Language Models not only consider the overall context of the article,but also can solve the current problem of weak text representation feature extraction,unable to learn a large amount of unsupervised data contained in linguistic knowledge,such as the BERT model proposed by Google.However,the Pre-trained Language Model also has the problems of too many parameters and long time for pre-training.To sum up,in order to better complete the task of text classification,this article bases on the current problems,carries out a series of research on text representation models,and builds text classification models based on this.The work of this paper mainly includes the following three aspects:(1)Aiming at the problems of insufficient text feature extraction and lack of category information in traditional text representation and classification models,this paper proposes a text representation and classification model LTCW_CNN based on fusion features and multi-channel CNN.First,this paper proposes a class probability variance CTF-IDF algorithm,which introduces the category probability information of the text,enriches the feature representation of the text,and makes up for the problem that traditional TF-IDF algorithms cannot extract category information.Secondly,this article further proposes a word embedding model CT_Word2vec with category and word frequency information.The model uses the CTF-IDF algorithm to calculate word weights and weight the word vector after Word2 vec vectorization.Then,this paper integrates single text representation models such as CT_Word2vec,TF-IDF_VSM and LSI to build a new text representation model LTCW.Finally,the text vector extracted by the LTCW model is used to fully extract the text features through multi-channel CNN and perform classification prediction.Experimental analysis is performed on two datasets,Fudan News Text and NetEase News.Experimental results show that the LTCW_CNN model performs better than the baseline model,with F1 values reaching 97.01% and 96.28% respectively.(2)Aiming at the problem that the BERT model pre-training method cannot fully consider the inter-word and inter-sentence information,this paper proposes a PreBERT model based on continuous masking words and the prediction task of above and next sentences.This paper first proposes a Continuous Masking Language Model(CMLM)based on the CoMASK method.CMLM masks randomly selected words and their adjacent words at a certain ratio,which improves the problem of continuity and dependency information between words that can only be randomly masked by a single word and ignored,and it can more fully learn the word information.Secondly,This paper improves BERT's Next Sentence Prediction(NSP)task to be the Above and Next Sentence Prediction(ANSP)task.ANSP considers the context information of a sentence and can more fully extract the relevant information between sentence pairs.By integrating improved CMLM and ANSP pre-training tasks,PreBERT has achieved better results in both single sentence classification and sentence pair classification tasks.On the Fudan News Text Dataset,NetEase News Dataset,BQ Dataset and LCQMC Dataset,the accuracy rate of the BERT model has improved by 0.22%,0.16%,2.17% and 1.27% respectively.(3)Aiming at the problems such as the timing and dependence of missing element pairs in the absolute position embedding of Pre-trained Language Models,large model parameters,difficult structural adjustment,and long time required for pre-training.such as BERT.Inspired by the Multi-Head Self-Attention Mechanism,this paper proposes a multilayer Multi-Head Self-Attention text representation and classification model PMSAN based on relative position embedding.PMSAN's multi-layered multi-head model structure can obtain semantic information inside a sentence on multiple scales.Relative position embedding can add information of feature word pairs when calculating the Multi-Head Attention parameter matrix.Compared with the traditional position embedding method,this method introduces timing information and has fewer parameters than traditional Pre-trained Language Models.The experimental results show that PMSAN has achieved better results on ten Chinese and English authoritative data sets at a small cost.We obtained 49.1%,84.1%,84.0%,61.9%,69.5,72.5%,93.2%,and 98.2% accuracy on eight English data sets,respectively.We obtained 98.4% and 97.3% accuracy on two Chinese datasets,respectively.This proves that the semantic interpretation of PMSAN model is stronger and more efficient.
Keywords/Search Tags:text representation, text classification, BERT, neural network, self-attention mechanism
PDF Full Text Request
Related items