Font Size: a A A

Study On Tibetan Information Retrieval&Search Results Clustering And System Implementation

Posted on:2014-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:D W WanFull Text:PDF
GTID:2248330398974728Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Tibetan is the carrier of Tibetan culture and Tibetan civilization. It has a long history and used by more than6million people in China. There are lots of Tibetan works, and their contents are extensive. With the support of Tibetan words by windows system, more and more Tibetan people start suffering the internet. However there is no Tibetan search engine available at present, so it is worthy for conducting an exploration about it. This thesis is focused on Tibetan segmentation, Tibetan web pages collection, Tibetan code change, search results clustering, etc. And it aims to implement a good Tibetan information retrieval system.The major contributions of this thesis are as follows:Firstly, a Tibetan segmentation algorithm is proposed for the system. There are no separators in Tibetan words, so it is necessary to divided by machine in Tibetan Information retrieval (IR). The current segmentation algorithms are mainly based on statistical probability part-of-speech tagging case-auxiliary words and continuous features, etc. But for lacking of Tibetan segmentation materials some of these algorithms can not be used in Tibetan the others are too complex to realize for Tibetan. The proposed algorithm used dictionary matching regulation and the features of Tibetan, and it finished a good results.Second, Tibetan document clustering is explored. Following problems were explored, victor model is used to Tibetan documents representation in clustering, Tibetan stop words were got by a lot of Tibetan documents statistics. At last portioning clustering and hierarchical clustering were used for Tibetan documents clustering.Third, Tibetan IR study and system realize. The study of Tibetan web pages collection, Tibetan encoding, Tibetan documents processing, documents storage, etc. These make computer can process Tibetan information. Then based on Lucene I realized the information system with foll functions. It can use as a Tibetan search engine. Combine previous results, I also realize the search results clustering.
Keywords/Search Tags:Tibetan segmentation, Tibetan clustering, Tibetan Information retrieval, cluster-based retrieval
PDF Full Text Request
Related items