Design And Development Of Multi-source Document Full-text Retrieval System Based On Lucene

Posted on:2015-12-09

Degree:Master

Type:Thesis

Country:China

Candidate:W G Du

Full Text:PDF

GTID:2428330488499663

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,electronic documents have emerged in large numbers and which taken a variety of forms.How to meet the users' needs quickly and accurately becomes a key technology in information retrieval.The full text retrieval system was derived from information retrieval technology,its core is multi-text retrieval technology and its aim is improving the retrieval efficiency of the needed documents effectively.This paper aims to design a multi-document full-text retrieval tool based on Lucene,to retrieve texts from PDF,Word,Excel and other documents in specified directory,and to secure channels for users' information retrieval.The full text retrieval system is mainly used for organizing,retrieving and querying the title,catalog and content of PDF,Word,Excel and other text format documents,and then obtain the matching documents according to users' query,its core technology mainly includes:document analysis,word processing,indexing structure,retrieval the index database,managing the results,etc.This paper uses PDFBox API and POI technology to extract the text content and other basic information for document analysis from PDF,Word,and Excel documents.On the above basis,it handles the word processing with package of Lucene or the ICTCLAS of Chinese Academy of Science which is the third party word processing system,and improves the accuracy of word segmentation and reduces the index size by combined with some pre-processing,such as the removal of punctuation and disable words.Then,after the pre-processing,the system organizes the text content and stores them as index files.Finally,the system retrieves related items in the index file according to users' retrieval conditions,finds corresponding documents,sorts the results based on correlation or revised time,and then returns them to the users.As a whole,the multi-source document full text retrieval system is proved to be more professional and accurate,and achieves its design aim which can satisfy the needs of individual users and increase the retrieval efficiency.The proposed multi-source document full text retrieval system is developed and completed according to series of tests based on the Struts2 framework.The function test result of the proposed system shows that it fulfills the basic function of multi-source document full text retrieval system based on Lucene efficiently and can realize the analytical document retrieval for the specified directory.The performance test result of the proposed system show that it greatly improves the system efficiency and shortens the time of indexing and retrieval by pretreatment operation.The result of compared with the existing text retrieval system shows that it has a significant improvement in precision.

Keywords/Search Tags:

Full-text retrieval, Text extraction, Multi-source document, Lucene, Index

PDF Full Text Request

Related items

1	Multi-document Full-text Retrieval System Design And Implementation
2	Design And Implementation Of Enterprise Knowledge Document Retrieval Management System
3	The Research And Implementation Of Full-Text System Based On Lucene And Textual Image
4	The Research And Implementation Of Full-text Retrieval System Based On Lucene
5	Multi-document Retrieval System Design And Development
6	Research And Application Of Open-source Full-text Search Engine
7	Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System
8	Research And Application Of Full Text Retrieval Technology Based On Lucene
9	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
10	The Research And Implementation Of Full-text Retrieval System Based On Lucene