Font Size: a A A

Design And Development Of Multi-source Document Full-text Retrieval System Based On Lucene

Posted on:2015-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:W G DuFull Text:PDF
GTID:2428330488499663Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,electronic documents have emerged in large numbers and which taken a variety of forms.How to meet the users' needs quickly and accurately becomes a key technology in information retrieval.The full text retrieval system was derived from information retrieval technology,its core is multi-text retrieval technology and its aim is improving the retrieval efficiency of the needed documents effectively.This paper aims to design a multi-document full-text retrieval tool based on Lucene,to retrieve texts from PDF,Word,Excel and other documents in specified directory,and to secure channels for users' information retrieval.The full text retrieval system is mainly used for organizing,retrieving and querying the title,catalog and content of PDF,Word,Excel and other text format documents,and then obtain the matching documents according to users' query,its core technology mainly includes:document analysis,word processing,indexing structure,retrieval the index database,managing the results,etc.This paper uses PDFBox API and POI technology to extract the text content and other basic information for document analysis from PDF,Word,and Excel documents.On the above basis,it handles the word processing with package of Lucene or the ICTCLAS of Chinese Academy of Science which is the third party word processing system,and improves the accuracy of word segmentation and reduces the index size by combined with some pre-processing,such as the removal of punctuation and disable words.Then,after the pre-processing,the system organizes the text content and stores them as index files.Finally,the system retrieves related items in the index file according to users' retrieval conditions,finds corresponding documents,sorts the results based on correlation or revised time,and then returns them to the users.As a whole,the multi-source document full text retrieval system is proved to be more professional and accurate,and achieves its design aim which can satisfy the needs of individual users and increase the retrieval efficiency.The proposed multi-source document full text retrieval system is developed and completed according to series of tests based on the Struts2 framework.The function test result of the proposed system shows that it fulfills the basic function of multi-source document full text retrieval system based on Lucene efficiently and can realize the analytical document retrieval for the specified directory.The performance test result of the proposed system show that it greatly improves the system efficiency and shortens the time of indexing and retrieval by pretreatment operation.The result of compared with the existing text retrieval system shows that it has a significant improvement in precision.
Keywords/Search Tags:Full-text retrieval, Text extraction, Multi-source document, Lucene, Index
PDF Full Text Request
Related items