Font Size: a A A

The Research Of Distributed Search Engine Technology Based On Pagerank Algorithm

Posted on:2014-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:X X GuoFull Text:PDF
GTID:2268330425983228Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since human society entered the era of electronic information, the Internet industry develops rapidly. Network information resources have gradually become an important way for us to access to information. Search engine is facing unprecedented challenges at present stage, technology innovation is imperative.In the face of the variety of information, whether can accurately retrieve the important information to meet the needs of users become one of the important index to judge the quality of search engine. Therefore, the web page ranking algorithms as the factors influencing the quality of search engine and received extensive attention. At the same time, in the face of the rapid development of Internet and the explosive growth of information data, the traditional centralized search engine has gradually exhaustion. Distributed search engine is good enough to solve the limitations of the centralized search engine in scalability, network information covering rate as well as real-time performance, etc. Changing the centralized structure that implement the system function in a host, a number of servers on the Internet realize the function of the search engine, and controlled by a center node to develop a distributed search structures. There has been increasing attention to research of distributed search engine by search engine operators, and distributed search engine become the development trend of the next generation of search engine.In this article, first through the study of web-based link structure of the classical page ranking algorithms-PageRank algorithms, an improvement PageRank algorithm is put forward according to the defects of equal authorities assignment and ignoring the random user behavior. The new algorithm makes use of transition probability of the markov chain. Using the ratio of web page indegree among of competitors and dual retrieval probability by users, transition probability is constructed to make sure the authorities assigned in accordance with. At the same time, this article designs a distributed search engine model based on Hadoop and Lucene open source framework.Introducing the hadoop distributed file systems and Map/Reduce calculation model in the traditional search engines. System could be divided into distributed crawler, distributed indexer and distributed Searcher modules to realize the distributed design of search engine. Useing the Master/Slave structure, a master node distributes task to each slave node realize functions. By analyzing the "heartbeat" record reported by slave node, master node controls the coordination.Improved distributed search engine model system asks low performance need to PC, with better extensibility and real-time performance, higher network coverage. In addition, this article puts the improved PageRank algorithm into the distributed system, optimizing the quality of retrieval, contributing a perfect combination of PageRank algorithm and the distributed search engine.
Keywords/Search Tags:PageRank Algorithm, Distributed Search Engine, HDFS, Map/Reduce
PDF Full Text Request
Related items