Font Size: a A A

Research On Tibetan Word Segmentation Algorithm Based On Deep Neural Network

Posted on:2022-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:R J LuFull Text:PDF
GTID:2518306764480424Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Tibetan word segmentation is the basis and forerunner of Tibetan natural language processing.The mainstream word segmentation models in Tibetan natural language processing focus on deep neural networks.Accordingly,this dissertation studies the Tibetan word segmentation algorithm based on deep neural network,designs and implements the Tibetan word segmentation system.The main work is summarized as follows:(1)Construction of Tibetan datasetTibetan corpus is acquired and made processing.A Tibetan word segmentation model is constructed using the dataset provided by Tibet University,and the corpus is presegmented by the word segmentation model.Through verification,a data set for Tibetan word segmentation task is obtained.Using simhash to deduplicate the Tibetan data set,a data set containing 74,384 Tibetan sentences is finally obtained.(2)Improvement of Tibetan word segmentation method based on long short-term memory(LSTM)network and conditional random field(CRF)The improved Tibetan word segmentation model of LSTM and CRF is proposed,and the Soft Attention mechanism is applied to improve the ability to extract the context information of Tibetan text sequences,and the syllable expansion method is applied to solve the problem of weak feature information of the input corpus.The experimental results show that,compared with the Tibetan word segmentation models of LSTM and CRF,the accuracy,recall and F1 of the Tibetan word segmentation models based on Soft Attention LSTM and CRF on Tibet-News dataset are respectively Up 2.9%,3.5% and3.2%.(3)Improvement of Tibetan word segmentation method based on Transformer and conditional random field(CRF)A Tibetan word segmentation model based on Transformer and CRF is proposed,and the self-attention mechanism is improved to reduce the training time of the model.The experimental results show that,compared with the Tibetan word segmentation model based on LSTM and CRF,the precision,recall and F1 indicators of the Tibetan word segmentation model based on Transformer CRF are improved respectively 3.4%,3.4%and 3.5% on the Tibetan-News dataset.(4)Design and implement of the Tibetan word segmentation systemBased on the above model,the Tibetan word segmentation system is designed and implemented by applying software engineering theory.It mainly includes sub-functions such as registration,login,user information management,Tibetan word segmentation,and file management.It realizes the function that the user can get the word segmentation result after inputting Tibetan text.
Keywords/Search Tags:Attention Mechanism, Long Short Term Memory Network, Conditional Random Field, Transformer
PDF Full Text Request
Related items