Research Of Internet Information Collection System Based On Cloud Platform Web Crawler

Posted on:2013-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:Z B Mu

Full Text:PDF

GTID:2298330467978141

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the trend of information content increasingly increase, a series of new technologies rapidly emerging, the cloud computing is one of them, and becomes a mainstream research direction, also becomes an important trend in the future. In this environment, many large companies have also joined the research and development on cloud computing, such as Google, IBM and Amazon. And the application areas are more and more widely. In order to facilitate learning and research, the laboratory also makes the plan on cloud computing research program to develop our own "cloud platform". It is based on Hadoop, and then we can design different applications. Hadoop is Apache’s open source cloud computing framework, it is facilitated and mature. This thesis just uses Hadoop to build cloud platform.In this thesis, the main research work is design an information collection system based on cloud platform web crawler, for news information, it implements on the Hadoop distributed platform. The main parts of the system are information collection module and the information extraction module. When designing the information collection module, this thesis designs and implements a web crawler based on cloud platform, and during the design of the crawler, this thesis adopts Map/Reduce design; when design the information extraction module, this thesis uses Html structural analysis-based approach to achieve. During the design process, it contains several major technologies:Cloud Computing, Hadoop Distributed platform, Web crawler and Web information extraction. This thesis introduces and analyzes these technologies in order. For cloud computing session, this thesis introduces the background, the current mainstream cloud platform, its own characteristics, and so on. For Hadoop platform session, this thesis introduces its two core technology:HDFS and Map/Reduce. For the mature web crawler and information extraction technology, this thesis just brief introduces theirs core principles.The significance of this thesis is to integrate theory with practice, apply the cloud computing theory to the actual project development. This thesis implements an internet information collection system based on cloud platform web crawler, solves many inadequacies of the existing system:low cost, poor scalability, and low efficiency. And implement an information collection system with simple flexibility, security, reliability, and high scalability.

Keywords/Search Tags:

cloud computing, information collection, hadoop, web crawler, informationextraction

PDF Full Text Request

Related items

1	Large-scale Bilingual Parallel Corpus Collection System Based On Hadoop
2	Web Information Crawler Of Design And Implementation Based On Computing Cluster
3	Research And Design Of A Distributed Web Crawler Based On Hadoop
4	Research And Application On Supporting Cloud Computing Of Fetching Weibo Information Online
5	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
6	Design And Implementation Of Management Subsystem Of Cloud Data Collection System
7	Design And Implementation Of Cloud Crawler Subsystem Of Cloud Data Collecting System
8	The Design And Implementation Of Education News Collection System
9	Research And Construction On Data Acquisition Model Of The Tourism Information Based On Hadoop Cloud Computing
10	Design And Implementation Of Usage Collection System In Cloud Computing Environment