Font Size: a A A

Design And Implementation Of The Crawler Log Data Information Extraction And Statistical System

Posted on:2013-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:G L WangFull Text:PDF
GTID:2248330374999135Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In a large extent, people depend on the search engine more and more, with the expansion of the network information. As an integral part of the search engine, quality of web pages by the crawler impacts on the search engine’s search results directly. Thus, although related work such as retrieval and index is done perfectly, the user experience doesn’t exist, as most of what the crawler takes done is garbage web page. Then we need to adjust the crawler scheduling and grab strategy, according to grab effect. Well, how can we evaluate the quality and effect of the web by the crawler? That is the problems need to be solved in this paper of crawler log data information extraction and statistical system.The work of this article is as follows:1. The crawler in seed merger scheduling and web page downloading will record log, which distributes in the crawler deployment of each node cluster. This article will introduce the crawler log data collecting, compressing file, and then the compressed file will be handled to the distributed file storage system HDFS, finally the index files will be produced.2. For a distributed crawler cluster, if the number of the downloaded url in a day is controlled between800million and thousands of million. The crawler log will take hundreds GB storage. The compressed file which is handled to distributed file storage system on the HDFS will take150GB storage or so every day. Single machine is ragged to deal with mass data. So this paper use information extraction technology, through the Hadoop as computing platforms, using the Hive to log data processing crawler structured. Then transform the statistical index with concern of crawler into Job through the Hq1statement, and submit to the Hadoop cluster for processing, finally the MapReduce calculation results will be imported to Mysql database.3. Finally, PHP lightweight framework CI (CodeIgniter) will be used to display pages and report mail. In the page or the mail includes the crawler index data which is in the Mysql database.This paper takes the crawler log data as data sources. As the experimental data shows, the mass data processing platform using Hadoop and Hive can complete effective information mining in the limited time and provide reliable data support for the crawler strategy adjustments.
Keywords/Search Tags:Information extraction, Crawler index data statistics, Hadoop, Hive
PDF Full Text Request
Related items