Font Size: a A A

Research On Korean Text Representation And Sentiment Analysis Based On Deep Neural Network

Posted on:2022-01-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Z JinFull Text:PDF
GTID:1488306728482374Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development and improvement of deep learning theory,deep learning-based models have become the mainstream research method of natural language processing in recent years.Representation learning is the basis of natural language processing based on deep learning,and its quality is directly related to the performance of downstream tasks.Due to the lack of Korean corpus and agglutinative characteristics of Korean,it has increased the difficulty of research on Korean natural language processing.This research attempts to solve the basic problems in Korean natural language processing such as Korean word vectors,automatic spacing,morphological analysis,and part-of-speech tagging,named entity recognition,and sentiment analysis from the perspectives of representation learning and model structure.These fields are usually the basis for subsequent more complex natural language processing tasks such as machine translation,reading comprehension,and dialogue robots.Therefore,this research can inspire subsequent research on Korean natural language processing.The main research content of this paper is summarized as follows:1)By analyzing the structure of existing word embedding models,we found a word embedding model suitable for Korean natural language processing.First,we compared and analyzed the existing word embedding models,such as NPLM,CBOW,and Skip-gram in Word2 vec,Glo Ve,fast Text,Swivel,etc.,and explained the relationship between these models.We analyzed the applicability of these models in Korean word embedding based on the language characteristics of Korean.Secondly,we evaluated the performance of these word embeddings in Korea through the word similarity evaluation dataset and the performance of these models in downstream tasks through the Naver movie review sentiment analysis dataset.Finally,through the above theoretical and experimental analysis,we found that the fast Text model has certain advantages in Korean.2)Proposed a new automatic Korean word spacing method.Automatic Korean word spacing which is similar to the Chinese word segmentation problem belongs to a fundamental problem in Korean natural language processing.First of all,to overcome the disadvantage of the traditional method which is dependent on manually extracted features,we proposed a Korean spacing-enhanced character embedding model KWSE.Through this model,we can obtain the character embedding containing semantic and spacing polarity information.Secondly,we combine Korean spacing-enhanced character embedding with LSTM-CRF to achieve the Korean spacing task.The experimental result shows that our method achieved92.86% F1-score,which is better than other methods.3)In terms of Korean part-of-speech tagging,a pipeline model and an end-to-end hierarchical sequence-to-sequence model are proposed.Part-of-speech(POS)tagging is a basic task in natural language processing.Korean POS tagging consists of morphological analysis and pos tagging step,which is different from English POS tagging.According to the characteristics of Korean part-of-speech tagging,we proposed two solutions.One is the pipeline method,which first recovered the morpheme base form through the sequence-tosequence(seq2seq)model,and then achieved the Korean part-of-speech tagging using the Bi-LSTM-CRF model.The other is an end-to-end approach based on seq2 seq.Existing seq2 seq based methods usually consider the context of the entire sentence in the modeling process.However,Korean morphological analysis relies more on local contextual information,and in many cases,there exists one-to-one matching between morpheme surface form and base form.To make better use of these features of Korean POS tagging,we propose a hierarchical seq2 seq model.In our model,the low-level Bi-LSTM encodes syllable sequence of an Eojeol,and the high-level Bi-LSTM incorporates Eojeol context information into Eojeol representation,and the decoder generates morpheme base form for every Eojeol.Besides,to improve the performance of the morpheme base form recovery,we fuse the local n-gram information of the morpheme surface form into the model through the Convolution layer and the attention mechanism.The experimental results on the Sejong corpus show that our model outperforms strong baseline systems in both morpheme level F1 and Eojeol level accuracy and achieves state of the art.4)Proposed a Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention.Named entity recognition(NER)is a fundamental task in natural language processing.The existing Korean NER methods use the Korean morpheme,syllable sequence,and part-of-speech as features,and use a sequence labeling model to tackle this problem.In Korean,on one hand,the morpheme itself contains strong indicative information of a named entity(especially for time and person).On the other hand,the context of the target morpheme plays an important role in recognizing the named entity(NE)tag of the target morpheme.To make full use of these two features,we propose two auxiliary tasks.One of them is the morpheme-level NE tagging task which will capture the NE feature of syllable sequence composing morpheme.The other one is the context-based NE tagging task which aims to capture the context feature of the target morpheme through the masked self-attention network.These two tasks are jointly trained with Bi-LSTM-CRF NER Tagger.The experimental results on Klpexpo 2016 corpus and Naver NLP Challenge 2018 corpus show that our model outperforms the strong baseline systems and achieves state of the art.5)Proposed a Korean sentiment analysis method based on sentiment-enhanced morpheme embedding.The sentiment polarity of words is one of the key factors of sentencelevel sentiment analysis.However,the existing methods of sentence-level sentiment analysis focus on the modeling of word sequences in sentences,while ignoring the sentiment information of word vectors.The initial morpheme vector used for the input of the Bi-LSTM model usually contains semantic information but lacks sentiment information.To tackle this problem,we proposed a sentiment analysis method that combines sentiment-enhanced morpheme embedding and Bi-LSTM and applied this method to the Korean sentiment analysis task.The experimental results show that the performance of the Bi-LSTM model with sentiment-enhanced morpheme embedding was improved compared to the conventional word vectors.For the Korean ABSA problem,we proposed a convolutional neural network and attention pooling mechanism.Experimental results show that our method outperformed the baseline models.
Keywords/Search Tags:Korean, Natural language processing, Deep learning, Word embedding, Part-of-speech tagging, Named entity recognition, Sentiment analysis
PDF Full Text Request
Related items