Font Size: a A A

Research On Technologies Of Log Big Data Analysis Platform

Posted on:2016-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:H J ZhouFull Text:PDF
GTID:2308330461984202Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The growing trend of Internet development becomes faster, the log data generated by the Internet also will be growth rapidly. Web logs are volume, variety, heterogeneous and dynamic. For Internet companies, how to timely and effectively process the big log data has become a great subject. The traditional single node centralized processing mode cannot deal with the big data. To Internet companies, the demand for log analysis is growing and changing. Log analysis engineers are often beset by growing and changing needs, internet companies must invest a lot of time to follow up the needs.This paper studied and analyzed the distributed storage, computing and scheduling, proposed a big log data analysis process, used the hadoop and hive to build a big log data analysis platform, the log analysis of large data routing and platform. This paper put forward a kind of process for large log data processing, including log collection, preprocess, storage, and analysis. The collecting process is responsible for bring the log data together from distributed web servers, then we can management and use the log data conveniently. Preprocessing procedure is responsible for cleaning and transforming the unformatted log. In storage procedure, we use hadoop and hive to store log data. Based on the previous three steps, in the analysis procedure, we use hive sql to analysis log data.The big log data analysis platform make a model for log collection, preprocess and analysis processes by abstracting this three processes to tasks, including collecting task, preprocessing task and analysis task. The big log data analysis platform offer users interfaces to complete the task configuration. The big log data analysis platform can be used easily by configuring the three tasks. The platform is responsible for task scheduling and task running. Users just need to wait for a result from the platform, and they do not need to know how the platform to schedule and run the tasks. In order to scheduling and running the tasks efficiently, this paper presented a framework named task scheduling and running engine. The engine used traditional master-slave architecture to build. The engine implemented the failure mechanism of static priority scheduling algorithm, and used distributed parallel method to execute. In the last chapter, this paper verified the practicability and efficiency of the platform through an experiment.
Keywords/Search Tags:web log, big data analysis, distributed computation, hadoop, hive
PDF Full Text Request
Related items