Font Size: a A A

Study On Efficient Indexing For Large Scale Chinese Text Retrieval Systems

Posted on:2006-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:J MiFull Text:PDF
GTID:2178360185996965Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Inverted Files are of great importance for IR systems, yet it is a time-consuming process to build one. We will focus here on how to speed up the process of indexing.In this thesis, we will state that there are two factors that slow down the speed of indexing. One is the ineffective use of system resources, due to the pauses caused by CPU and I/O when they have to wait for each other. The other one is based on the fact that document parsing, especially Chinese word segmentation, is one of the slowest steps when building an inverted file, that is to say, it is the bottleneck.To solve the first problem, we introduce the concept of pipeline to our indexing system. With the help of pipeline, we can improve the parallelism of our indexing system, and make better use of system resources so as to shorten the indexing time.As for the second one, we experiment on different lexicon structures to evaluate their performance on word segmentation and parsing. Our results confirm us that double-array trie is an excellent candidate for Chinese word segmentation, and it can also speed up indexing significantly.
Keywords/Search Tags:inverted files, indexing, pipeline, word segmentation, double-array trie
PDF Full Text Request
Related items