Font Size: a A A

Research And Implementation Of Distributed Search Engine Based On Hadoop

Posted on:2018-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhangFull Text:PDF
GTID:2348330512483270Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Today,the Internet has developed into an era of universal participation,the way people use the Internet become more and more diverse,and make it easier to publish information,which makes the network is full of massive data.How to index and retrieve these massive data is the focus of the current search engine research.Obviously,the centralized index has been unable to meet the requirements of the current large data environment.Therefore,the combination of distributed technology and indexing technology is used to realize the distributed indexing and retrieval,so the index strategy has important research value.The search engines could get a wide variety of data,and when users search,a large number of results pages will be retrieved.Therefore,it's a very meaningful research work to sort the search results according to their importance.So,from the above point of view,this thesis studies the distributed indexing strategy and link ranking algorithm.The main work is as follows:1.The distributed search engine related theories have been studied,focusing on distributed indexing strategies,including the local and global index strategies.Then this thesis proposes a hybrid Indexing Strategy Based on DHT and Map Reduce.The implementation principle and process of Map Reduce are also explained.2.Link analysis and sorting algorithm-HITS are analyzed and studied,and through the link associate cite degree and the linktext associate similarity degree to propose the HVHITS algorithm.And by combining the idea of Trust-Score and the ACO,a feedback improvement strategy is proposed.3.The design and implementation of distributed search engine system,which is based on the Hadoop,is completed.In the index and retrieval module,the mixed index strategy based on Map Reduce and DHT is implemented in parallel.In the link analysis and sorting module,the FHVHITS algorithm is parallelized by Map Reduce.4.At the end of this thesis,by selecting the test themes and methods,the performance of the distributed search engine system and the improved HITS algorithm are tested and evaluated.
Keywords/Search Tags:Distributed Search Engine, HITS, Hybrid Index, Hadoop, Feedback
PDF Full Text Request
Related items