Font Size: a A A

The Research On Full-Text Search And Related Technologies

Posted on:2013-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LinFull Text:PDF
GTID:2218330371453063Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer industry, more and more electronic information uses computer storage devices as the carrier. These information can be divided into structured and unstructured data. According to statistics, unstructured data occupies over 80% of the amount of information. People can manage unstructured data efficiently by full-text search technology. Full-text search from the initial string matching program has evolved to large-scale integrated management software on the large text, voice, image, moving image and other unstructured data. Now, the full-text retrieval systems have become synonymous of a new generation of information management systems.We researched full-text search and related technologies in three main areas which are Chinese segment word, indexing and retrieval.Compared some of the existing typical fast Chinese segment word dictionary mechanisms and methods, and study some improved skills of existing dictionary mechanisms. Base on the double-word hash dictionary, we proposed a improved method which is sort hash conflicts word in the second hash table by frequency of the word, the methods come to speed up the process of segmenting word. We also studied Chinese word segmentation method based on automata methods, proposed implement Chinese word segmentation with hardware. Designed experiments to verify the performance of the typical dictionary mechanism and their improved structure and our methods on Chinese word segmentation. Integrated the English word recognition, case conversion, stopping word filtering, stemming and other functions, we achieve a tokenizer. It can be used alone or in conjunction with other tools.Studied the traditional inverted index methods and some of its improvement programs, to learn the characteristics of Lucene index structure, we designed a domain-oriented incremental inverted index-format. Proposed the method changes the number of field from relative to the absolute. In this way, query on more indexes segments only need once operator of converting the name of field to number of field, which save the query time. Compression technology was also used to reduce the size of the index file. Base on the index-file-format designed, we completed an indexer tool, and useed the indexer to design experiments to test the incremental performance and compression effect.Learning technologies about process of retrieval, document ranking and relevance feedback and so on among the search process and its related technologies and algorithms. We implemented a searcher which upports the term-queries, boolean -queries, and free-text-queries which used similarity calculation method of vector space model. We use indexer and the searcher to do a variety of experimental tests to verify the correctness of indexing and retrieval.Tokenizer, indexer, searcher together constitute a basic retrieval system framework. With this framework, it is convenient to achieve a variety of retrieval systems.
Keywords/Search Tags:full-text search, Chinese word segmentation, inverted index, retrieval system
PDF Full Text Request
Related items