The Research On Full-Text Search And Related Technologies

Posted on:2013-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Lin

Full Text:PDF

GTID:2218330371453063

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of computer industry, more and more electronic information uses computer storage devices as the carrier. These information can be divided into structured and unstructured data. According to statistics, unstructured data occupies over 80% of the amount of information. People can manage unstructured data efficiently by full-text search technology. Full-text search from the initial string matching program has evolved to large-scale integrated management software on the large text, voice, image, moving image and other unstructured data. Now, the full-text retrieval systems have become synonymous of a new generation of information management systems.We researched full-text search and related technologies in three main areas which are Chinese segment word, indexing and retrieval.Compared some of the existing typical fast Chinese segment word dictionary mechanisms and methods, and study some improved skills of existing dictionary mechanisms. Base on the double-word hash dictionary, we proposed a improved method which is sort hash conflicts word in the second hash table by frequency of the word, the methods come to speed up the process of segmenting word. We also studied Chinese word segmentation method based on automata methods, proposed implement Chinese word segmentation with hardware. Designed experiments to verify the performance of the typical dictionary mechanism and their improved structure and our methods on Chinese word segmentation. Integrated the English word recognition, case conversion, stopping word filtering, stemming and other functions, we achieve a tokenizer. It can be used alone or in conjunction with other tools.Studied the traditional inverted index methods and some of its improvement programs, to learn the characteristics of Lucene index structure, we designed a domain-oriented incremental inverted index-format. Proposed the method changes the number of field from relative to the absolute. In this way, query on more indexes segments only need once operator of converting the name of field to number of field, which save the query time. Compression technology was also used to reduce the size of the index file. Base on the index-file-format designed, we completed an indexer tool, and useed the indexer to design experiments to test the incremental performance and compression effect.Learning technologies about process of retrieval, document ranking and relevance feedback and so on among the search process and its related technologies and algorithms. We implemented a searcher which upports the term-queries, boolean -queries, and free-text-queries which used similarity calculation method of vector space model. We use indexer and the searcher to do a variety of experimental tests to verify the correctness of indexing and retrieval.Tokenizer, indexer, searcher together constitute a basic retrieval system framework. With this framework, it is convenient to achieve a variety of retrieval systems.

Keywords/Search Tags:

full-text search, Chinese word segmentation, inverted index, retrieval system

PDF Full Text Request

Related items

1	Design And Implementation Of Chinese Retrieval System For Vertical Domain
2	Research On Full-Text Retrieval Technology For The Single Chinese Character
3	The Research And Design Of Chinese Full Text Information Retrieval Systems Based On PSO
4	Design And Improvement Of Website Full-text Retrieval System Based On Lucene
5	Research And Implementation Of An Open High-Performance Platform Of Full-Text Retrieval
6	Research Of Index In Chinese Full-text Retrieval System
7	The Research Of Full-Text Retrieval And Its Relative Security Technology For Chinese
8	A Research Of Full-Text Retrieval Based On Inverted Index
9	Military Retrieval System Design And Implementation
10	A Research On Chinese Word Segmention Based On The Combination Of Dictionary And Statistics And Full-Text Retrieval System Design