Font Size: a A A

Research On Indexing Strategies For Chinese Informatin Retreival

Posted on:2007-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y HanFull Text:PDF
GTID:2178360185489426Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
There is more and more information with the dramatic development of Internet over the world. As core technology of Internet, Information Retrieval (IR) technology has great commercial value; it decides what information presented to the user; and it is also the key step in informatics processing. So researching and mastering the key technology of IR has heavy impact on economics, social and military.This thesis focuses on indexing strategies in Chinese IR. Indexing strategies are special problem for Chinese IR because there is no space to separate the word in Chinese natural language texts. The compared indexing strategies include Chinese character indexing, word indexing and n-gram indexing based on Chinese characters. The research topics include:1. Chinese auto segmentation. Chinese segmentation is an indispensable step in Chinese IR based word indexing. Firstly the ambiguity in Chinese segmentation is analyzed; secondly the language model used to disambiguate is introduced; then the smoothing algorithms used to improve the performance of the language model are presented. Our Chinese segmentation system achieves very high accuracy by integrated these technology. The segmentation system can fit for the need of Chinese IR.2. The implement of IR, i.e. the data arrangement in IR system. IR system needs high efficiency of accessing document collection, so the data arrangement is necessary. The thesis firstly explores the indexing methods in IR, i.e. forward index, inverted index; secondly two types of key words searching algorithm (i.e. B-tree and Hash table) are presented; and then the compression of inverted indexes and text is introduced; finally, we adopted the proper technology to arrange data, so the experiments can be efficiently done.
Keywords/Search Tags:Chinese Information Retrieval, Indexing Strategies, Probability Model, 2-Possion Model
PDF Full Text Request
Related items