Font Size: a A A

Multi-document Full-text Retrieval System Design And Implementation

Posted on:2014-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:J J CaiFull Text:PDF
GTID:2268330401465403Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The amount of information in the computer and network has been increasedgeometrically under the background that the information technology is developingrapidly. It has been an urgent problem in present that how to retrieve useful informationin massive data quickly and accurately. The emergence of full-text retrieval technologycould solve this problem well. Firstly, the whole information in the text is cut into wordsegment, on the basis of word segmentation, the retrieval index is built. when usersneed to retrieval information in all the texts, in order to achieve efficient and accurateinformation retrieval, the retrieval program search the retrieval index quickly based onthe retrieval text and feedback results to user.As an open source tool, Lucene has the advantage of open and object-oriented,developers can take advantage of interface of Lucene to develop independence retrievalapplication. This thesis analysis the function and technical principles of Lucene in detail,on the basis of Lucene source code, a full-text retrieval system which support a varietyof document types is designed and developed. The main work of this thesis reflected inthe following aspects:(1). The shortcomings of full-text retrieval which based on relational database hasbeen analyzed. This thesis researched the method of full-text retrieval and its basicprinciples based on file system and analyzed the structure and index file of full-textretrieval engine Lucene detailedly.(2). By using document parsing technology, the document type that Lucene canprocess is extended to overcome the shortcoming that Lucene can only process textdocument. In document parsing module, all other format documents, such as pdf, word,excel, etc. are converted to the type that Lucene supported, that make Lucene canretrieval variety commonly used document types.(3). The full-text retrieval system is designed and implemented, in this system,Chinese segmentation, index establishment and maintenance has been implemented, andfinally the research results are sorted according to the improved ratings formula. Systemcan response the user’s requests, and sort query results to the user, The demand that retrieve the full text has been meet. This thesis used struts framework to build platform.By using the MVC (Model View Controller) to design and implement the system, thestructure of the system becomes clear and the divide of each modules become moreexplicit, that is benefit to the maintenance and expansion of the system.(1). This system will deal with all kinds of document directly, instead of usingXML document as the middle carrier. The system can build a index database whichcontains all kinds of document type. System test shows that the system implements fulltext retrieval with a variety of file types and the recall and precision of this system isperfect, This system can achieve efficient document retrieval function.
Keywords/Search Tags:multi-source document, Lucene, full-text retrieval, index
PDF Full Text Request
Related items