Font Size: a A A

Implementation And Optimization Of A Large-scale Enterprise Search Engine

Posted on:2017-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:D R LaiFull Text:PDF
GTID:2348330536453388Subject:Engineering
Abstract/Summary:PDF Full Text Request
In order to search in a lot of document in the enterprises,it is often needed to use the search engine for full-text search.Currently,the most popular solution is the open source distributed search engine ElasticSearch.However,there is a greater difference between the specific document and the search unit,which is far away from the the general situation.Futher more,the size of the document data is much larger than the size of the web page which is often indexed in general search engine.Dealing with a large scale document infomation retrieval,general search engines have a poor performance in this situation.This paper provides an optimization for search engine which hosts large scale of document data.The core is the use of the open source distributed search engine,ElasticSearch,and optimize itsconfiguration according to retrieve these documents on demand.The optimization including: optimizing the index process,optimizing the index storage strategy,spliting the document data into smaller units and optimizing the online display programs.The main process is converting the documents to HTML fils and split the HTML files to smaller units.By doing this,it makes the index units more reasonable and cloud solve related problem in searching process.In order to improve the users' search experience,this paper also developed some core functions,including keyword extraction and summarization extraction.In this paper,Keyword extraction module aims to solve the problem of searching accuracy.In the searching process,documents will get a higher relevant score while the query match the keyword in document.And keywords extracted help users with easier understanding of the document data.This module mainly implements several keywords extraction model(including the new words detection,extraction strategy based on statistical infomation,extraction strategy based on graph and extraction strategy based on clustering analyze).And by using of the machine learning method,Learning to Rank,the program can select the best keywords from the result of several extraction strategies.And more over,the F-Measure cloud exceed 50%.The summarization extraction module is committed to improving the user search experience,provide the basis for users to quickly understand the content of the document.The module mainly implements several kinds of the technology of extraction model(including extraction based on the statistical infomation,graph-based extraction,extraction based on clustering analyze).And by using of the machine Learning method,Learning to Rank,the program can select the best summary of the results of several extraction models.As a result,the F-Measure cloud exceed 50%.The research of this paper provides a viable solution for the enterprise search engine optimization in the similar scene.Open source search engine,although currently reached a level out of the box,but the auxiliary search function tuning and search resuls optimization is still a lack of moe complete implementation.This paper is committed in a real scenario,developed to deal with lage scale of data,to build a high performance and user-friendly search engine.This paper is trying to complete the implementation of the program and record the results of comparative tests.
Keywords/Search Tags:Search Engine, ElasticSearch, Document Converter and Split, Learning to Rank, Keyword Extraction, Summarization
PDF Full Text Request
Related items