Font Size: a A A

A Study On Tibetan Named Entity Recognition Based On Pre-training Techniques

Posted on:2024-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z H XuFull Text:PDF
GTID:2568307085970819Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet and information technology,the demand for automatic processing and analysis of Tibetan text is increasing.However,Tibetan named entity recognition,an essential task in Tibetan information extraction,has relatively little research.The research work addresses the limitations of existing Tibetan named entity recognition methods and proposes a pre-training technology-based method for Tibetan named entity recognition.The Tibetan Word2 Ve model,Tibetan ELMo pre-trained language model,and Tibetan ALBERT pre-trained language model were designed and implemented,and these three pre-trained models were applied to the Tibetan named entity recognition task.Experimental results show that pre-training technology can effectively improve the performance of Tibetan named entity recognition.Among them,the ALBERT-Bi LSTMCRF model has the most significant improvement,with an F1 score of94.20,an increase of 7.61% compared to the baseline Bi LSTM-CRF model.Since the baseline Bi LSTM-CRF model simultaneously performs entity boundary delineation and entity category judgment in the Tibetan named entity recognition task,this leads to slow model computation speed and high resource consumption.The research proposes a Cascade technique that divides Tibetan named entity recognition into two subtasks(entity boundary delineation,entity category judgment)performed in stages,simplifying the model structure.Experimental results show that this method can significantly reduce the training time of the Tibetan named entity recognition model,with the Cascade-Bi LSTM-CRF model reducing the training time by 28.30% compared to the Bi LSTM-CRF model.Combining Cascade technique with pre-training technique,Cascade-Word2Vec-Bi LSTM-CRF model,Cascade-ELMo-Bi LSTMCRF model,and Cascade-ALBERT-Bi LSTM-CRF model are designed and implemented,and the training time of the models without Cascade technique is significantly reduced compared to those without Cascade.To further improve the effectiveness of Tibetan named entity recognition,The research explores the integration of word information into the process by using automatic segmentation software to segment Tibetan text,training Tibetan Word2 Vec word vectors,and combining them with syllable vectors as Syllable-Word-Fusion(SWF)inputs to the Bi LSTM-CRF model for Tibetan named entity recognition.Experimental results show that the SWF-Bi LSTM-CRF model has an F1 score improvement of 1.45% compared to the Bi LSTM-CRF model.The research also investigates the combined application of pretraining technology,Cascade technology,and SWF technology to Tibetan named entity recognition,designing and implementing the Tibetan Cascade-SWF-Word2Vec-Bi LSTM-CRF model,Cascade-SWF-ELMoBi LSTM-CRF model and Cascade-SWF-ALBERT-Bi LSTM-CRF model.Experimental comparisons show that models combining these three techniques achieve better results in Tibetan named entity recognition tasks while also reducing model training time.
Keywords/Search Tags:Tibetan, named entity recognition, pre-training, Cascade, Syllable-Word Fusion
PDF Full Text Request
Related items