Investigation On Web Crawler Technology Based On Hadoop Platform

Posted on:2018-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhang

Full Text:PDF

GTID:2348330536979634

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

The rapid development of the Internet has brought about explosive growth of Internet content information,and the high level of information has also brought great challenges to the information it needs.In the face of such a huge information retrieval and the user's personalized retrieval needs,how to improve the efficiency and accuracy of network information search has become an urgent need to solve the key issues in the network information search technology,network reptile technology is an important part.Relying on a single computer to accomplish such a huge task in the context of the use of Hadoop cloud platform for distributed computing and storage,Hadoop platform to run the improved network crawler technology to achieve efficient and accurate information to crawl.Based on the deep research of Hadoop cloud platform and network reptile technology,the shortcomings of existing subject crawling algorithms are found and improved,and the optimization feature extraction is proposed.Based on the semantic tree to improve the correlation calculation,the theme climb based on weight optimization Take the algorithm and carry out MapReduce processing on the cloud platform to improve the efficiency and accuracy of the subject crawling algorithm.In order to optimize the link structure of the Bloom filter based on the Bloom filter,a hierarchical Bloom filter tree is constructed based on the attribute.Fast and accurate to heavy,in the cloud platform for processing,improve algorithm performance and time and space efficiency,and ultimately get more effective,more accurate link to re-algorithm.Based on the study of the principle of Hadoop network reptile system,the system is constructed and the web page download module,web document analysis module and link processing module are designed and implemented in detail.The improved algorithm is applied in the realization of key function modules.On the basis of constructing the system,the improved algorithm is proved by experiment,the results show that it is feasible and effective in improving the performance and efficiency of the algorithm.

Keywords/Search Tags:

Web Crawler, Hadoop, Topic crawl, Relevance Calculation, Link Deduplication, Bloom Filter

PDF Full Text Request

Related items

1	Research On Technologies Of Distributed Link Extraction And DNS Cache
2	Optimization And Implement Of The Topic Web Crawler Correlation Algorithms
3	Research On The Topic Crawler Algorithm Based On Vector Space Model
4	Distributed Crawler Based On Hadoop
5	Research And Application Of Data Deduplication Technology Based On Bloom Filter
6	Research On Key Technologies Of A High-performance Web Crawler System
7	The Study And Improvement Of Deduplication Of Files In Cloud Storage Based On Bloom Filter
8	Research On The Key Technology Of Focused Crawler
9	Design And Implementation Of Multithreading Web Crawler Oriented Topic
10	The Research Of The Bloon Filter In Distributed Crawlers