Font Size: a A A

Reseash On Some Key Technologies Of Enterprise Search Engine

Posted on:2016-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y C BaiFull Text:PDF
GTID:2298330467492565Subject:Information security
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, businesses, schools and communities store more and more data on the network. Due to its crawler’s long cycle time, its resource slow updates, its high costs, and its broad index, the traditional general search engine can not be applied to the enterprise network which is specific, frequentyly update, cost constrained and high precision retrieval required. In this thesis, for the characteristics of the enterprise network, the author introduces the impact of web update time, and then combine the word frequency and location of the weighted algorithm with PageRank algorithm to design a scoring sorting algorithm for the enterprise network, Meanwhile, the author improves existing segmentation methods, and then the author designs and implements a business-oriented net full-text search engine.The main contributions of this paper are as follows:1. The author makes the research on related technology about seach engine and Hadoop distributed platform. This thesis is mainly focused on the composed of the search engines and related key technologies, and makes research on inverted index, Chinese word, many aspects of technology sorting score, and ROBOTS protocol. In this paper, related technologies about distributed platform for Hadoop are also analyzed, mainly focusing on MapReduce and HDFS.2. The author designs a scoring algorithm of a full-text search for enterprise networks. The key factors affecting the full-text document retrieval accuracy is scoring algorithm and Chinese word segmentation algorithm. In this paper, the author combines word frequency and location weighted algorithm with PageRank algorithm, and introduces the time factor to design the scoring algorithm of a full-text search for enterprise network. In order to improve the efficiency of scoring algorithm, this paper introduces Ik Analyzer, and then the author improves IK Analyzer based on the coupling of two-word disambiguation algorithm for segmentation algorithm to decrease the ambiguity word.3. The author designs and implements a distributed search engine for enterprise network based on Hadoop distributed platforms. In this paper, based on the Hadoop distributed platforms, the author designs and implements a search engine for enterprise network design with the full-text search scoring algorithm designed in this paper.4. The author makes a test research about the distributed search engine implemented in this paper. At the end of this paper, the author deploies a three node distributed seach engine, and then tests the search engine.
Keywords/Search Tags:search engine, hadoop, scoring algorithm, Chinese wordsegmentation
PDF Full Text Request
Related items