Font Size: a A A

Research On The Semantic Retrieval Method Of Tibetan Language Based On Neural Network Language Model

Posted on:2022-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiaoFull Text:PDF
GTID:2505306509997769Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the acceleration of information globalization,people’s life,work and learning styles are rapidly changing,which also promotes the development of networked applications of minority languages and scripts.In recent years,Tibetan language information has become more and more abundant on the Internet.How to quickly and accurately find out the Tibetan language information that meets the needs of users from the huge amount of network information resources has become an urgent problem to be solved by the current Tibetan language information processing technology.Traditional information retrieval is more based on keyword matching,which only considers the literal matching between words and ignores the semantic level associated information,but the Tibetan grammar is diverse and the phenomenon of multiple meanings of words is common,which makes the user’s retrieval experience poor.In summary,this paper introduces a neural network language model into Tibetan information retrieval technology and extracts the semantic relationship between query words and documents through BERT pre-training,thus improving the performance of Tibetan semantic retrieval.The work and main contributions of this paper are as follows.1.To address the problem that there is no public dataset for Tibetan,this paper uses crawler tools to collect Tibetan news corpus from Tibetan News Network,China Tibetan Network,Tibet Daily,Qinghai Lake Tibetan Network and other websites as the training dataset for the pre-trained model,and collects data from China Tibetan Netcom as the dataset for Tibetan semantic retrieval,after which and builds the pre-trained BERT language model.2.The BERT pre-trained Tibetan language model is fine-tuned and then applied to the Tibetan semantic retrieval task.The semantic information between documents and query terms is fully explored;the linear combination probability distribution between documents and queries is calculated,and the similarity between documents and query terms is calculated using this distribution,and finally the N documents with the highest relevance to the query keywords are returned,so that the semantic information of the user query is obtained and the documents with the highest semantic information of the user query are returned.3.By comparing the performance differences between the pre-trained BERT Tibetan semantic retrieval model and TF-IDF and word vector in the information retrieval task,the effectiveness of the method in this paper is verified,and the performance of Tibetan semantic retrieval is further improved.The results show that the comprehensive evaluation rate is higher than the traditional keyword-based retrieval method by24.31% and higher than the word vector-based semantic retrieval method by 19.57%.4.For the above experiments,a simple Tibetan semantic retrieval system was developed.It is capable of full-text search of corpus contents,and the search result pages are sorted from top to bottom according to semantic relevance.
Keywords/Search Tags:neural network language model, tibetan, semantic retrieval, BERT pre-training
PDF Full Text Request
Related items