Font Size: a A A

Research On Mongolian-Chinese Cross-language Information Retrieval Model

Posted on:2019-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:L J MaFull Text:PDF
GTID:2438330551960571Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the globalization of Internet information,web information on the Internet has exploded and been largely accumulated.Many countries and nations use different scripts,and so many messages were written in different languages.In addition to finding information in their native languages,more and more users want to get information in other languages.Cross-language information retrieval is a method that meets this need,so it is becoming an important research direction.The development of the Internet in China has also promoted the development of information processing for ethnic minority languages,and more and more minority language websites have emerged.Information retrieval for minority languages has also been greatly developed.By cross-language information retrieval,users can get information for ethnic minority languages with the familiar languages.Query-based translation and document-based translation are two common ways for the cross-language informaiton retrieval.By these methods,original language and target language get unified,and cross-language information retrieval task are turned into single-language information retrieval task.Dictionary-based translation can be used to tranlate original language words,and this method is often combined with query expansion to improve search performance.Training machine translation models usually requires high quality parallel corpus to achieve a certain degree of machine translation precision,however,high-quality parallel corpora are rare,especially for minority languages,which belong to low resources,it is more difficult to obtain the parallel corpora.When using document translation methods,we need consider the storage space consumption caused by translation.Considering the reasons above,this paper focuses on Mongolian-Chinese cross-language information retrieval with cross-language word vectors.The main contributions of this article are as bellows:1)This paper uses cross-language word vectors to map Chinese query words to Mongolian,and use the methods proposed in this paper to filter and sort the expanded query words so as to optimize the mapped query words to get the proper Mongolian word(s).Mapping Chinese query words to Mongolian words will be done before making query,and during this progrocess the proper Mongolian word(s)can be found with Cross_valid method,proposed in this paper,which takes advantage of the context words to find the proper word(s).Compared with machine translation,this method,using cross-language word vectors to map the Chinese query words,does not need the high quality parallel corpus,which is a big advantage considering the lack of the parallel corpus.And this method take much less space.Compared with documents translation-based method,this method has the ability to process"out-of-vocabulary words".2)This paper designs and implements a Mongolian-Chinese cross-language information retrieval system which can be used to find Mongolian information for Chinese queries.A valid Mongolian web crawler was developed with available technologies.Crawled Mongolian information will be put into database after a series of preprocesses,which will be used to build index,and a valid Chinese-Mongolian cross-language information retrieval system was built after using cross-language word vectors to mapping the original language.
Keywords/Search Tags:Mongolian-Chinese cross-language information retrieval, Cross-language word vectors, Mongolian, Query expansion
PDF Full Text Request
Related items