Font Size: a A A

Research Of Internet Information Collection System Based On Cloud Platform Web Crawler

Posted on:2013-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z B MuFull Text:PDF
GTID:2298330467978141Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the trend of information content increasingly increase, a series of new technologies rapidly emerging, the cloud computing is one of them, and becomes a mainstream research direction, also becomes an important trend in the future. In this environment, many large companies have also joined the research and development on cloud computing, such as Google, IBM and Amazon. And the application areas are more and more widely. In order to facilitate learning and research, the laboratory also makes the plan on cloud computing research program to develop our own "cloud platform". It is based on Hadoop, and then we can design different applications. Hadoop is Apache’s open source cloud computing framework, it is facilitated and mature. This thesis just uses Hadoop to build cloud platform.In this thesis, the main research work is design an information collection system based on cloud platform web crawler, for news information, it implements on the Hadoop distributed platform. The main parts of the system are information collection module and the information extraction module. When designing the information collection module, this thesis designs and implements a web crawler based on cloud platform, and during the design of the crawler, this thesis adopts Map/Reduce design; when design the information extraction module, this thesis uses Html structural analysis-based approach to achieve. During the design process, it contains several major technologies:Cloud Computing, Hadoop Distributed platform, Web crawler and Web information extraction. This thesis introduces and analyzes these technologies in order. For cloud computing session, this thesis introduces the background, the current mainstream cloud platform, its own characteristics, and so on. For Hadoop platform session, this thesis introduces its two core technology:HDFS and Map/Reduce. For the mature web crawler and information extraction technology, this thesis just brief introduces theirs core principles.The significance of this thesis is to integrate theory with practice, apply the cloud computing theory to the actual project development. This thesis implements an internet information collection system based on cloud platform web crawler, solves many inadequacies of the existing system:low cost, poor scalability, and low efficiency. And implement an information collection system with simple flexibility, security, reliability, and high scalability.
Keywords/Search Tags:cloud computing, information collection, hadoop, web crawler, informationextraction
PDF Full Text Request
Related items