Distributed Theme Reptiles Based On Hadoop And Their Realization

Posted on:2016-09-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Ren

Full Text:PDF

GTID:2208330473961447

Subject:Computer technology

Abstract/Summary:

Along with the extensive application of the Internet, network information increases exponentially, at the same time,the demand for accessing to information is also increasing. How to use existing network and resources to provide users with effective information is becoming an urgent problem.The focus crawler is one of effective means to solve those problems and cloud computing develops make it possible to increase crawler system efficiency.The Hadoop platform developed by Apache is a user-friendly open cloud architectures, the main objective of this research is to design and implement a theme crawler system in this framework. The main work is follows:(1)Discussing the Hadoop related knowledge such as the calculation model of MapReduce and the HDFS distributed file system, and then discussing the framework, work flow and characteristics of focus crawler further. After that, in order to obtain a more professional, accurate theme information, this paper studies the key technologies such as correlation discriminant, page text extraction, hyperlink extraction etc. topic crawler; and based on the research work, using the existing academic achievements, on the theme of relativity judging technology made some improvements, making the system positioning and searching for information more precisely, the extracted data is more in line with the actual needs.(2) Design a focus crawler system based on Hadoop under the circumstance, describe in detail the workflow and general frame of system. In order to make the system helpful for the information processing and indexing, we design the content extract module to batch filtration to grab the page, and the requirements of the page text content is extracted to make the information structural(3) Research expound the overall architecture of system and the realization of all modules, including implementation of data storage structure, system function module division and the moduleâ€™s Map/Reduce implementation and so on.(4) After analyzing the result of the experiment:we conclude, all modules of the theme crawler are running very well, and this system can achieve high accuracy in collecting themed information, at the same time, comparing with stand-alone system, its efficiency is higher in collecting data, the flexibility and extensibility are greatly improved.

Keywords/Search Tags:

theme crawler, Hadoop, topic similarity

Related items

1	The Design And Implementation Of The Topic-focused Web Crawler System
2	Research On Topic Focused Web Crawler And Related Technologies
3	Design And Implementation Of The Theme Crawler For Procurement Clues In The Automotive Field
4	Research And Implementation Of Multithreading Web Crawler Based On Theme
5	Research And Design Of Topic Crawler Through Tunnels Algorithm
6	Investigation On Web Crawler Technology Based On Hadoop Platform
7	Stock Research Engine Based On Theme Crawler
8	Research On Key Technology Of Subject Network Crawler
9	Research On Theme Reptiles Based On Educational Information Resource Ontology
10	Research And Implementation On Theme Web Crawler Of Supporting Ajax