Font Size: a A A

Design And Implementation Of Distrbuted Search Engine Based On Hadoop Cloud Platform

Posted on:2017-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:M X WuFull Text:PDF
GTID:2348330512959052Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the explosive growth of Webpage data brought a severe test of large data to the network storage traditional products,so the new concept of cloud storage emerges as the times require.Cloud storage is developed based on cloud computing.Cloud computing can be regarded as the extension of distributed computing,parallel computing and grid computing,the network huge computing program is split into many smaller subroutines,and make huge to the server group system constitute,a fter calculation and analysis of the calculation results returned to the user of the system.However,the traditional search technology has become powerless.The latest application of cloud storage technology has brought the innovation for the traditional search industry,the traditional SkyDrive search industry will gradually be replaced based on data bank.At present,some cloud computation storage products provide a file data storage,file data synchronization and other functions,but these products also exist some shortcomings,such as: the limited capacity;transmission file size limit;the transmission format restrictions;file operation monitoring is not comprehensive enough;file synchronization efficiency low;cloud storage platform not perfect etc..The explosive growth in the amount of data in the traditional search engine shows various problems,it is difficult to meet the needs of the user's search.Based on the analysis of existing distributed search engine technology,summarizes the advantages and disadvantages of the existing system,based on the Hadoop cloud platform,using Map-Reduce programming framework,the realization of a distributed search engine system,which can provide good service for distributed retrieval library,portal website,forum or individual.HDFS,jpathwatch class library and Rsync difference data synchronization algorithm to realize the Hadoop cloud platform based on shared file synchronization,so as to meet the needs of users.The research work of this paper contains the following aspects.First,the system uses Map-Reduce programming framework to realize distributed indexing subsystem and distributed query system,with the computational performance,reliability and good expansibility.Map/Reduce programming framework in Hadoop is Google published implementation of Map-Reduce based on.The user can not consider distributed processing of distributed storage,job scheduling,load balancing,fault tolerance and network communication and other complex problems,only need to write a Map function and corresponding Reduce function,you can handle the distributed task.Second,this paper innovatively presents a search algorithm for adaptive switching index size based solution.After testing,the scheme in different size of index cases has better search efficiency.Design a user preference based search.The search for the user to provide a more flexible way to search,is convenient for users to obtain more accurate search results.To improve the TF-IDF algorithm,improved the Webpage scoring strategy.Third,the system based on the design of a real-time monitoring protocol event queue and based on the difference of the data block synchronization protocol,monitor real-time file system of Java language based on the open source Jpathwat ch library.Make up the traditional system events button statically bound,to realize dynamic monitoring multiple events.The difference between a portion of the file operations using the Rsync algorithm,synchronization.Make up for deficiencies in the traditional file system full replication synchronization,data transmission as little as possible.
Keywords/Search Tags:Distributed compute, Search Engine, Map-Reduce, Hadoop
PDF Full Text Request
Related items