The Design And Implementation Of Log Analysis System Based On Spark

Posted on:2015-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:J H ( L i u Liu

Full Text:PDF

GTID:2308330461455044

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Currently, the application of the Internet has penetrated into enterprise office systems, and the achievements of enterprise businesses often require the Internet.Through the network, information transmission reduces the cost and improves office efficiency. Howerver, with the convenience network has contributed, the enterprise employees visit work-irrelated websites during working hours. It brings negative effect to both business and network environment. As consequences, there is a need of an audit system providing the enterprises for users’ network access behavior. The results which system recorded will be stored regulately and accurately in the form of texts.With the growth and expansion of the Internet enterprises and application scale, general use of machines with a single log analysis system has no longer meet the current demands. As a result, massive data processing cluster becomes the ideal platform for log analysis. The original data processing framework was proposed by Google in 2003-2006, afterwards, a similar framework, hadoop, was born as a distributed computing framework. At that time, the massive data processing’s performance excelled in the Internet industry. Nevertheless, by using only the Hadoop framework, it is not enough to support real-time analysis and Iterative computing scene. Therefore, after 2009, many enterprises have proposed improved calculation framework in succession, such as Dremel, Spark etc..Based on the above situation, the massive literature reading and reference, as well as the common demands of the user behavior observation for the enterprise, this paper designs a massive log data analysis platform based on Spark. Besides the design of the four module:the log collection, logic processing, webpage display, and task management, the access.log of the Squid server is used in this platform. The four module implements the collect and import of the data, the process of data analysis and processing, the display of a client providing user operation and result processing, monitoring and management of cluster. Compared with Hadoop, Spark brings substantial improvement of performance through memory computing.

Keywords/Search Tags:

Spark, Shark, Resilient Distributed Datasets, log analysis

PDF Full Text Request

Related items

1	Research And Design Of Clustering Ensemble System Based On Spark
2	Research On Spark Caching Strategy Based On Task Structure Optimization
3	Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets
4	Research On Memory Data Management Technology In Spark
5	Research On Recommendation Algorithm Based On Spark
6	Research On Cache Mechanism And Job Scheduling Policy In Spark
7	Research And Implementation Of Distributed Workflow System Based On Shark System
8	Fractal Analysis Of Datasets Using Distributed Computing
9	Research On Distributed Clustering Algorithm Based On Spark And Implementation On Social Media Analysis
10	Architecture Analysis And Improvement Of Open Source Workflow Engine Shark