Font Size: a A A

Research And Application Of Lucene Full-text Retrieval Technology In Patent Information Service Platform

Posted on:2011-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:X R ChenFull Text:PDF
GTID:2178360305976543Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
This thesis derives from the requirements for biomedical patent information service platform. Based on in-depth researches on Lucene full-text retrieval tool package and some related technologies, we extend Lucene Chinese word segmentation module, and improve the Lucene default sorting algorithm. Furthermore, we extend the module for multiple format documents so as to enable the patent system designed in this thesis supportable for patent documents of different formats. Finally, we apply the researches above into the patent information service platform, which positively improves the performance of the patent retrieval system.The main contributions in this thesis are as follows:i. Make an in-depth research into the Lucene tool package and then we analyze the situations of some formats of documents'processing technologies which are frequently used in everyday life. And we focus on the researches into the Chinese word segmentation and the sorting mechanism of Lucene to provide the theoretical basis for scientifically applying these technologies into the patent information service platform.ii. Extend the Chinese word segmentation module in Lucene, and propose an automatic Chinese word segmentation technology based on the rules and suffix array according to the features of patents and the difficulties in automatic Chinese word segmentation. Experiments show that this technology can greatly increase the precision and the recall of the Chinese word segmentation in patents.iii. We propose a method for calculating the weights of characteristic words, by making some improvements on the traditional TF-IDF formula according to the features of patents, and conduct user-defined sorting on the retrieved results. Experiments show that with this sorting method we can get better matched documents.iv. To make the patent system in this thesis neither limited to plain text search, nor need to convert into the middle document format, we design a common interface, which can process different formats of patent documents (such as PDF,WORD and HTML) and convert them to the formats Lucene can process,with the assistance of resolving tools developed by third-party. In this way, the patent system in this thesis can support the retrieval on all the common formats of patent documents.v. Finally, we apply the Lucene full-text retrieval technology in patent information service platform. Experiments and practical use show that the patent information service platform in this thesis behaves better in the aspects of sorting the retrieved patents, the searching precision rate, the searching range and the response time, which greatly improve the performance of the patent retrieval system.
Keywords/Search Tags:Patent Retrieval, Lucene, Full-text Retrieval, Chinese Word Segmentation, Sorting Order
PDF Full Text Request
Related items