Design And Implementation Of The Crawler Log Data Information Extraction And Statistical System

Posted on:2013-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:G L Wang

Full Text:PDF

GTID:2248330374999135

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In a large extent, people depend on the search engine more and more, with the expansion of the network information. As an integral part of the search engine, quality of web pages by the crawler impacts on the search engine’s search results directly. Thus, although related work such as retrieval and index is done perfectly, the user experience doesn’t exist, as most of what the crawler takes done is garbage web page. Then we need to adjust the crawler scheduling and grab strategy, according to grab effect. Well, how can we evaluate the quality and effect of the web by the crawler? That is the problems need to be solved in this paper of crawler log data information extraction and statistical system.The work of this article is as follows:1. The crawler in seed merger scheduling and web page downloading will record log, which distributes in the crawler deployment of each node cluster. This article will introduce the crawler log data collecting, compressing file, and then the compressed file will be handled to the distributed file storage system HDFS, finally the index files will be produced.2. For a distributed crawler cluster, if the number of the downloaded url in a day is controlled between800million and thousands of million. The crawler log will take hundreds GB storage. The compressed file which is handled to distributed file storage system on the HDFS will take150GB storage or so every day. Single machine is ragged to deal with mass data. So this paper use information extraction technology, through the Hadoop as computing platforms, using the Hive to log data processing crawler structured. Then transform the statistical index with concern of crawler into Job through the Hq1statement, and submit to the Hadoop cluster for processing, finally the MapReduce calculation results will be imported to Mysql database.3. Finally, PHP lightweight framework CI (CodeIgniter) will be used to display pages and report mail. In the page or the mail includes the crawler index data which is in the Mysql database.This paper takes the crawler log data as data sources. As the experimental data shows, the mass data processing platform using Hadoop and Hive can complete effective information mining in the limited time and provide reliable data support for the crawler strategy adjustments.

Keywords/Search Tags:

Information extraction, Crawler index data statistics, Hadoop, Hive

PDF Full Text Request

Related items

1	The Hadoop-based Statistics Of Mass Data On Huge Website And Its Application
2	A Design And Implementation On Storage Structure Extension Of Big Data Warehouse Hive
3	The Design And Implementation Of Network Authentication System Based On Hadoop/hive
4	Design And Implementation Of Massive Web Log Analysis System Based On Hadoop/Hive
5	Design And Implementation Of Hive-based Purchase And Sale Data Warehouse System
6	Research On Key Business Data Extraction And Display Technology Under Big Data Background
7	Implementation And Application Of E-commerce Data Analysis Platform Based On Hive
8	Compatible Study Of Hadoop For Efficient Analyzing And Processing Of Big Data
9	Web Management System Based On Hadoop
10	Based On Hadoop Electric Offline Patterns Of Data Mining System Design And Implementation