Font Size: a A A

Bolstering CNN With Self-attention For Sentiment Analysis Research Of Multi-Language Mixed Text

Posted on:2022-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhuFull Text:PDF
GTID:2518306335497604Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Sentiment analysis of multi-language mixed text aims to identify and classify texts by collecting people's opinions,attitudes and emotions expressed on social media.Multilingual mixed texts are commonly found in Europe,Africa,Southeast Asia and other regions.Users usually combine the native language of the region with the main common language(English),resulting in multilingual mixed texts.The mixed-text communication method is not only easy and convenient,but also reduces the burden of people's language knowledge.However,it is quite difficult for non-native native speakers to understand these multilingual texts.Because multilingual mixed text does not comply with formal grammatical rules,transliteration rules,sentence structure and other reasons,sentiment analysis methods in general domains(mainly English)cannot be applied to sentiment analysis tasks of multilingual mixed texts.In addition,multilingual mixed text data sets generally have the characteristics of small scale,sparseness,and noise,which brings new challenges to sentiment analysis tasks.Therefore,the design and implementation of a model algorithm that can automatically recognize and analyze the emotional polarity implicit in the multilingual mixed text has important practical significance and application value.In view of the above situation,this paper proposes a self-attention enhancement model to carry out sentiment analysis research on multi-language mixed text.This work is mainly composed of two parts:The first part is based on the XLM-Roberta self-attention mechanism model,which is used to realize the sentiment analysis task of multi-language mixed text.First,the preprocessed text is input into the XLM-Roberta pre-training module,and the XLM-Roberta After the original output,the output of the last hidden state of XLM-Roberta is also input into Bi LSTM,and the output of the hidden layer of Bi LSTM is given self-attention weight.Finally,the experiment will be the original output of XLM-Roberta with self-attention The weighted representation of the force weight is connected to the output of the vector,so as to better predict the emotional polarity of the classification task.Because the original output of XLM-Roberta usually cannot fully summarize the semantic content of the input,a large number of semantic information features can be learned from the top hidden layer(also called the semantic layer)of XLM-Roberta.The self-attention weight assignment will consider the existence of multiple emotional load-bearing units.In addition,Bi LSTM can capture the long-term dependence between bilingual word sequences and character sequences.The second part is the integrated framework of mixed language sentiment analysis based on the self-attention-enhanced CNN model.It is improved based on the design principle of the XLM-Roberta self-attention mechanism.By inputting the output of XLMRoberta to the self-attention-based In the integrated model of Bi LSTM and CNN,Sub word Embedding makes full use of the intermediate features after the convolution operation,and uses the vector gating mechanism to combine the characters and word embeddings of Sub word Embedding and bilingual pre-trained word vector output to train the model.Among them,CNN considers the order of words and the context in which the words appear to some extent.The idea of using several filter sizes is to capture contexts of different lengths.The CNN model performs well on both positive and negative tweets.The self-attention model performs better than the CNN model on neutral tweets,but it is not very good on positive samples and negative samples.The two are just complementary,which is the main reason for choosing integration.In this thesis designed model achieves the best results on the two public data sets of Dravidian and Sentimix.F1 values of 0.862,0.846,0.776,and 0.77 were achieved in Hindi-English,Spanish-English,Malayalam-English and Tamil-English respectively,which are the current existing models in these two data The best score on the set.
Keywords/Search Tags:Deep learning, Sentiment analysis, Code mixing, Self-attentional mechanism, Sub word Embedding
PDF Full Text Request
Related items