Design And Implementation Of Massive Web Log Analysis System Based On Hadoop/Hive

Posted on:2012-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Liu

Full Text:PDF

GTID:2218330368987761

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Web log processing has been a hot research question. With the rapid development of Internet technology, the amount of information generated by the network is becoming more and more. Moreover, web log processing is also facing new problems. For a data center, it will not only produce massive web log data, but also generate log files of different formats. How to store and deal with massive, heterogeneous web log generated by the data center is the main content of this thesis.Hadoop is a popular large scale data processing framework. It can run on multiple platforms, and has good robustness and scalability. Hadoop implement the MapReduce algorithm. The users have to write MapReduce programs that are specific to their tasks.MapReduce programs are at a relatively low level, users must write a lot of codes in order to complete a specific task. Hive is an open source data warehouse tools that is based on Hadoop. It introduces some concepts of the traditional database, and it supports a kind of SQL like language. So that, users who familiar with traditional database development can develop quickly, and the amount of code can be reduced significantly.This thesis takes in-depth study on these two tools, including their respective associated concept and technology. This study also includes the use of these two tools, including how to configure an environment based on Hadoop/Hive, how to maintain the cluster system composed by Hadoop and Hive and how to develop on the platform based on Hadoop/Hive, for example, how to develop MapReduce programs, how to use Hive to solve problem data processing by the SQL-like language which provided by the Hive.This thesis designed and implemented a web log analysis system based on Hadoop/Hive according the study of these two tools. This system is logically divided into four functional modules. The log data collecting module synchronize the web log data that generated by all the various front-end web site to the log collecting site, and then, it run background scripts to import data to the table that has been established. Query analysis module completes the preprocessing of the web log, receives the query requests and returns query results. Storing and processing module is designed to complete the actual storage of data, including the original data, the cleaned data and various other temporary data. In the results outputting module, we choose a kind of language that is responsible for communicating with Hive, completes codes of statistics and shows results in the form of web pages eventually. This web log analysis system makes full use of the data processing ability of Hadoop and advantage of simplifying application development. The system has a clear advantage in Big Data processing, and has high practical value.

Keywords/Search Tags:

web log, cloud computing, Hadoop, Hive

PDF Full Text Request

Related items

1	Research And Design Of User Network Behavior Analysis And Mining System Based On Clouding Computing
2	The Research And Practice Of Performance Optimization Based On Hive
3	Querying Relational Database Based On Hadoop Platform And Its Implementation
4	Research On Data Management And Parallel Docking In Virtual Screening Based On Hadoop
5	Research And Implementation Of Cloud Computing-Based General Platform For Processing Data Of Internet Of Things
6	Design And Implementation Of Contextual Marketing Based On Distributed Computing Hive And Data Mining
7	The Research On Data Processing Cloud Platform Of Large Open Style Wharf Mooring Monitoring System
8	The Research About Stream Media Surveillance System Based On Cloud Computing
9	Design And Implementation Of The Online Shopping System Based On Hadoop Cloud Computing Framework
10	Research On The Key Technologies Of Cloud Computing Platform Hadoop