The Internet’s increasing popularity has caused a swift rise in the quantity of text information.It is necessary to perform natural language processing tasks such as proofreading,classification,translation,and information retrieval more accurately and efficiently.Word is the smallest unit to express semantic information,so the word sequence obtained by segmentation has semantic information.In order to facilitate the computer to understand human language,we need to segment the text to extract the important content better.Hence,word segmentation is an essential work of natural language processing.In this thesis,the Tibetan word segmentation technology is studied in terms of the design of the word position tag set and the design and implementation of the word segmentation model.The main contents include:(1)Proposed eight-word-tags labeling methodIn order to learn more semantic information and improve the performance of word separation,an eight-word-tags set is obtained by expanding the four-word-tags set and formulating the eight-word tagging set annotation rules.Tibetan word segmentation based on word-position tagging needs many experimental corpora.Manual annotation of the corpus is more inefficient and error-prone.Therefore,an eight-word-tag set labeling algorithm is designed.(2)Designing a BiLSTM_CRF Tibetan word separation model incorporating AttentionAlthough BiLSTM_CRF,which fuses BiLSTM and CRF,can automatically obtain contextual information and consider the relationship between output tags,BiLSTM_CRF cannot highlight locally focused information and loses extended sequence history features.To address the problems,this thesis proposes a BiLSTM_CRF Tibetan text separation model,Attention_BiLSTM_CRF,based on BiLSTM_CRF,which incorporates Attention to first obtain global contextual information through BiLSTM,then use Attention to enhance local focus information and mitigate the impact of history information loss.Finally uses CRF to learn the relationship between tags to circumvent illegitimate tags.(3)Experimental verification of the validity of the eight-word-tags labeling method and the Attention_BiLSTM_CRF Tibetan word separation modelIn order to verify the effectiveness of the eight-word tagging method and the Attention_BiLSTM_CRF Tibetan word separation model,experiments were conducted using the CRF,BiLSTM,BiLSTM_CRF,and Attention_BiLSTM_CRF Tibetan word separation models under the four-word-tags set,the six-word-tags set and the eightword-tags set respectively.The experiments show that the best performance of the Attention_BiLSTM_CRF model is achieved when the Attention type is Spare Self,embedded after forward LSTM and reverse LSTM. |