Font Size: a A A

Investigation On Web Crawler Technology Based On Hadoop Platform

Posted on:2018-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2348330536979634Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has brought about explosive growth of Internet content information,and the high level of information has also brought great challenges to the information it needs.In the face of such a huge information retrieval and the user's personalized retrieval needs,how to improve the efficiency and accuracy of network information search has become an urgent need to solve the key issues in the network information search technology,network reptile technology is an important part.Relying on a single computer to accomplish such a huge task in the context of the use of Hadoop cloud platform for distributed computing and storage,Hadoop platform to run the improved network crawler technology to achieve efficient and accurate information to crawl.Based on the deep research of Hadoop cloud platform and network reptile technology,the shortcomings of existing subject crawling algorithms are found and improved,and the optimization feature extraction is proposed.Based on the semantic tree to improve the correlation calculation,the theme climb based on weight optimization Take the algorithm and carry out MapReduce processing on the cloud platform to improve the efficiency and accuracy of the subject crawling algorithm.In order to optimize the link structure of the Bloom filter based on the Bloom filter,a hierarchical Bloom filter tree is constructed based on the attribute.Fast and accurate to heavy,in the cloud platform for processing,improve algorithm performance and time and space efficiency,and ultimately get more effective,more accurate link to re-algorithm.Based on the study of the principle of Hadoop network reptile system,the system is constructed and the web page download module,web document analysis module and link processing module are designed and implemented in detail.The improved algorithm is applied in the realization of key function modules.On the basis of constructing the system,the improved algorithm is proved by experiment,the results show that it is feasible and effective in improving the performance and efficiency of the algorithm.
Keywords/Search Tags:Web Crawler, Hadoop, Topic crawl, Relevance Calculation, Link Deduplication, Bloom Filter
PDF Full Text Request
Related items