Font Size: a A A

Build Data Acquisition System Based On Lucene Search Engine Laboratory

Posted on:2014-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y W LiuFull Text:PDF
GTID:2268330398995400Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the establishment and development of teaching and research information systems, produces a great deal of professional information. Which not only is the structured information for research and the experimental information, student information stored in database, but also there is a great amount of unstructured information like teaching of scientific research data, experimental data. Some of this kind of information may be stored in the database, a large amount of information is stored in the file server or the content management server system.This paper mainly studied the retrieval of large amount of unstructured data like laboratory data, teaching and research information in which was stored in the file server or the content management server system. Set up laboratory data search engine based on Lucene-MonsterSearch search engine system. The MonsterSearch search engine system consisted of the Parse module and Search module. Parse module used Tika parsing framework extracted the text content and associated metadata from unstructured data, used the Lucene search framework to create the index then stored indexed data in database. Search module called the Lucene search framework achieved retrieval operation for the user to retrieve laboratory various information resources.The works mainly done in this paper are as follows:First of all, the paper depth analyzed the structure of the Lucene search framework, retrieval mechanism, data flow, the index structure, analysis and scoring mechanism. Then defined Lucene internally calls the timing and processing logic, revealed the data structure of the Lucene index and index optimization strategy. With comprehensive understanding of the Lucene search framework. Through mathematical derivation of the Lucene core Ratings formula, in-depth understanding Lucene scoring mechanism, and provided a basis for building search engine system.Secondly, the paper carried on a detailed analysis and expound of Apache Tika framework parser interface which extracted text content and metadata from unstructured data. Subsequently explained how Tika framework determined the type of document, and extracted document text information. Through the depth research analysis of the Tika language identification mechanism, we found the way to solve the problem of Chinese support through the construction of an N-gram analyzer.Again, according to needs analysis achieved MonsterSearch search engine system. The system combined IKAnalyzer parser to achieve accurate segmentation of search term to avoid the Chinese word support adverse. Through optimization strategy achieved index optimize.The system used a multi-threading technology to parse unstructured information and created indexes, made full use of the CPU resources to improve indexing speed. Through the analyzed of the characteristics of the search engine as well as the used of system resources, to use Berkeley DB stored the index information. The system built a variety of retrieval methods for unstructured data such as teaching and research data, experimental data. System highlighted search results to provide users with a better user experience. During System operation and maintenance process, solved the memory management and indexing backup issueFinally, search engine has completed the deployment running on HP ProLiant DL380G7server. The system functional testing and search quality evaluation results show that the system meets the design requirements.
Keywords/Search Tags:Lucene, Tika, unstructured data, index, search engine
PDF Full Text Request
Related items