Font Size: a A A

Research On Thai Word Segmentation And Entity Extraction Technology

Posted on:2022-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:H W WuFull Text:PDF
GTID:2518306554971469Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Word segmentation and entity extraction are important tasks in natural language processing.The effect of word segmentation and entity extraction based on large languages has been greatly improved,but for Thai language,which lacks corpus resources,a single neural network method can hardly give full play to its advantages.The rule of word formation in Thai is similar to that of Chinese.There is no clear separator between words and it is impossible to distinguish the boundaries between words.At the same time,corpus resources are relatively scarce.This article combines the characteristics of Thai word formation,we conducted research on Thai word segmentation and entity extraction tasks,as detailed below.Aiming at the problem that Thai as a low-resource language,the performance of word segmentation tasks is not high.We propose and implement a sequence-to-sequence-based Thai word segmentation model Glove-Seq2 Seq.The model uses bidirectional long short-term memory(LSTM)and gated recurrent unit(GRU)neural network to transform one input sequence into another output sequence.We have compared with multiple word segmentation models on datasets in 4 different fields.The final experiment shows that our model is simple and effective,and has strong domain applicability.In the case of limited data resources,the desired effect can still be achieved.For the entity extraction tasks such as Thai names,place names,and organization names,we expand it to two tasks for research,namely entity recognition and entity classification.First,we build a model based on the label attention network for layer-by-layer improvement.The model is divided into two layers.Each layer contains a BiGRU layer that encodes sequence information and a label attention inference layer that infers label information.We have compared with many commonly used entity recognition models,and the results verify the superiority of this model,which is more suitable for Thai entity recognition tasks.For the Thai entity classification task,we constructed a Bi LSTM neural network classification model based on attention enhancement.The model effectively enhances the entity classification effect by combining the Bi LSTM neural network with the attention mechanism.We have carried out experiments on multiple groups of entity classification tasks,and confirmed that the proposed method can effectively enhance the entity classification effect of Bi LSTM.At the same time,with the increase of classification tasks,the performance of the model is better.In particular,in entities with richer data,the F1 value of the classification effect reaches to 94.76%,which can achieve the ideal classification effect.
Keywords/Search Tags:Word segmentation, Entity recognition, Entity classification, Thai
PDF Full Text Request
Related items