Font Size: a A A

Research Of Mongolian Information Retrieval Method Based On The LDA And System Implementation

Posted on:2017-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:R G L SiFull Text:PDF
GTID:2348330485471361Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
For the rapid development of information technology, people can easy access global network information from anytime and anywhere. The explosion of information gives us great convenience. At the same time, it promotes the development of minority language network application and plays an active role in the development of Mongolian search engine. Mongolian is one of the important minority language in our country. How to find Mongolian information that people need in large of Mongolian resources quickly and accurately is a great challenge in Mongolian information retrieval.In traditional Mongolian information retrieval, the query is only according to matching keyword which just consider match between the words literally without taking full advantage of semantic information. In fact, the probability of same object described by the same key words is less than 20%, due to the diversity of Mongolian language. The phenomenon of semantic characteristics is relatively common, just as more than a word means and what a word means. This made larger gab of the related documents retrieved may set and needed by the user query information, and not able to retrieve the most relevant information with users, which can lead to loss retrieval efficiency.According to the above problem, this paper mainly find the solution from the aspect of semantic information mining, the topic model LD A model can realize the implied information in the document of co-occurrence extraction and semantic relationships between documents, LDA not only inherited the advantages of the traditional retrieval model, at the same time provides a platform for the development of semantic retrieval, create conditions for exploring the depth of retrieving information, improve the accuracy.This paper proposes a new Mongolian information retrieval method, which combine LDA topic model and the language model. Firstly, the method build unigram or bigram model for Mongolia text to get the text language probability distribution. Then we use LDA to build topic model and gibbs sampling method to calculate the model parameters for digging out potential document themes probability distribution. Finally, calculate the linear combination of the document theme and language distribution probability distribution, in order to calculate the distribution of document theme and the similarity between the query keywords, finally return to the most relevant documents to the topic query keywords. In this method, the language model can make full use of the Mongolian grammatical features, and the LDA model has a good generalization ability to learn. Combining the two methods can better realize the theme of Mongolian document semantic retrieval, and improve the retrieval accuracy.Through primary school language teaching material covered in the international coding standard corpus experiments as the test dataset, the results show that compared with traditional model based on keywords and independent use LDA subject information retrieval methods, this method improves the accuracy of information retrieval and the recall rate, the validity and practicality of the method is validated.Furthermore, this paper also designed and implemented the information retrieval system oriented education application of Mongolia language material corpus. The system uses Java Web framework, and can full text search to corpus content, as well as by the title, version number, press, education stage items such as database retrieval. Search results page will be presented according to the habit of traditional Mongolian landscape from left to right, relevant content can be highlighted.
Keywords/Search Tags:Mongolian, topic model LDA, LM, Gibbs, information retrieval
PDF Full Text Request
Related items