Font Size: a A A

Design And Implementation Of Massive Web Log Analysis System Based On Hadoop/Hive

Posted on:2012-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:2218330368987761Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web log processing has been a hot research question. With the rapid development of Internet technology, the amount of information generated by the network is becoming more and more. Moreover, web log processing is also facing new problems. For a data center, it will not only produce massive web log data, but also generate log files of different formats. How to store and deal with massive, heterogeneous web log generated by the data center is the main content of this thesis.Hadoop is a popular large scale data processing framework. It can run on multiple platforms, and has good robustness and scalability. Hadoop implement the MapReduce algorithm. The users have to write MapReduce programs that are specific to their tasks.MapReduce programs are at a relatively low level, users must write a lot of codes in order to complete a specific task. Hive is an open source data warehouse tools that is based on Hadoop. It introduces some concepts of the traditional database, and it supports a kind of SQL like language. So that, users who familiar with traditional database development can develop quickly, and the amount of code can be reduced significantly.This thesis takes in-depth study on these two tools, including their respective associated concept and technology. This study also includes the use of these two tools, including how to configure an environment based on Hadoop/Hive, how to maintain the cluster system composed by Hadoop and Hive and how to develop on the platform based on Hadoop/Hive, for example, how to develop MapReduce programs, how to use Hive to solve problem data processing by the SQL-like language which provided by the Hive.This thesis designed and implemented a web log analysis system based on Hadoop/Hive according the study of these two tools. This system is logically divided into four functional modules. The log data collecting module synchronize the web log data that generated by all the various front-end web site to the log collecting site, and then, it run background scripts to import data to the table that has been established. Query analysis module completes the preprocessing of the web log, receives the query requests and returns query results. Storing and processing module is designed to complete the actual storage of data, including the original data, the cleaned data and various other temporary data. In the results outputting module, we choose a kind of language that is responsible for communicating with Hive, completes codes of statistics and shows results in the form of web pages eventually. This web log analysis system makes full use of the data processing ability of Hadoop and advantage of simplifying application development. The system has a clear advantage in Big Data processing, and has high practical value.
Keywords/Search Tags:web log, cloud computing, Hadoop, Hive
PDF Full Text Request
Related items