Study On Efficient Indexing For Large Scale Chinese Text Retrieval Systems

Posted on:2006-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:J Mi

Full Text:PDF

GTID:2178360185996965

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Inverted Files are of great importance for IR systems, yet it is a time-consuming process to build one. We will focus here on how to speed up the process of indexing.In this thesis, we will state that there are two factors that slow down the speed of indexing. One is the ineffective use of system resources, due to the pauses caused by CPU and I/O when they have to wait for each other. The other one is based on the fact that document parsing, especially Chinese word segmentation, is one of the slowest steps when building an inverted file, that is to say, it is the bottleneck.To solve the first problem, we introduce the concept of pipeline to our indexing system. With the help of pipeline, we can improve the parallelism of our indexing system, and make better use of system resources so as to shorten the indexing time.As for the second one, we experiment on different lexicon structures to evaluate their performance on word segmentation and parsing. Our results confirm us that double-array trie is an excellent candidate for Chinese word segmentation, and it can also speed up indexing significantly.

Keywords/Search Tags:

inverted files, indexing, pipeline, word segmentation, double-array trie

PDF Full Text Request

Related items

1	Research On Chinese Word Segmentation For Large Scale Information Retrieval
2	Research On Efficient Index Structure And Parallelization Based On Double Array Trie
3	Research And Improvement Of ICTCLAS Chinese Lexical Analysis System
4	Research And Application Of The Key Techniques In Chinese Query Answering System Of Networking Education
5	Dictionary Based Chinese Word Segmentation Algorithm And Its Application In Nutch System
6	Research On Cloud Data Security Deduplication Technology Based On Double Array Trie Tree
7	Research Of Chinese Word Segmentation With Conditional Random Fields And Implementation
8	The Design And Implementation Of Site Search Engine Based On The Inverted Index And The Trie
9	Research Of Personalized Search Based On Trie Tree
10	Study On The Theory & Practice Of Automatic Indexing Of WWW Science And Technology Information Resources