Font Size: a A A

Research Of Search Engine Key Technique And Optimize Performance

Posted on:2009-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2178360272456443Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Full-text indexing and retrieval is an efficient way to retrieve specific information from a numerous and complicated database. Lucene, a member project of the open source organization Apache Jakarta, is a toolkit which can be easily employed to achieve information indexing and retrieval in application systems.The core and extended libraries in Lucene enable automatic Chinese word segmentation in the same way that English words are segmented. However, due to the difference between these two languages, the results are rough and the efficiency is poor. Based on a detailed study about the full-text retrieval approach which the Lucene core uses to conclude Chinese words segmentation, this paper presents a Chinese words segmentation module, which is based on word library and uses the positive direction maximum matching algorithm. This module is then implemented and tested. Experimental results show that it is more effective and efficient than both the single Chinese word segmentation approach used in the Lucene core library and the binary segmentation approach used in the Lucene extended library targeting CJK(Chinese, Japanese and Korean) languages.Practice of information retrieval suggests that there are great deal of synonyms, it is difficult to enumerate all the presentations of words in the same concept, thus it may omit something in result.In this paper, I realize a solution of synonyms retrieval. Lucene used inverted index file which increased efficiency and saved storage space. Index file include word term, its document id, position offset and frequency of occurrence. To intervene manually while writing to file, we insert the synonyms into proper relative position of TokenStream, then reset their position offset, the synonyms will be appear in same position with the primary terms, then we realize synonyms retrieval. This paper design a synonyms storage structure in words table, the face prove that it has good access efficiency and easy to maintained.In the application aspect, this paper work mostly in the design and implement of the Patent information search system. The system realize constructing words analyzer, indexer, searcher and database memory design on the basis of relative work such as document data process, file format Transformation. Finally, the system realizes many functions such as Chinese and English patent retrieval, browser patent abstract information, browser and download full-text introduction.
Keywords/Search Tags:Full-text Retrieval, Chinese Words Segmentation, Synonym Words Retrieval, Lucene, Patent Retrieval
PDF Full Text Request
Related items