Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data

Posted on:2016-08-25

Degree:Master

Type:Thesis

Country:China

Candidate:S W Zhou

Full Text:PDF

GTID:2308330479493949

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet technology and cloud computing,the data is growing in all walks of life,the era of big data has come.In the filed of telecom,the network traffic data,which is used to monitor Internet,is growing rapidly, too.Its storage and analysis need the support of big data technology.Hadoop has been introduced,HDFS and Map Reduce,two important components of the Hadoop ecosystem,solve the demand of network traffic data in a certain extent.But Map Reduce is a bacth data processing model,and it can not sovle the other data analysis demand,like interactive query and streming data processing.As a result of this,many different distributed computing frameworks,like Map Reduce,Impala and Storm, have been used at the same time.Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.There are many join operation of big table in the analysis of network traffic big data,and the join operation in spark has poor performance,the reason is there are many data that do not meet the join condition in the shuffle stage,and it will bring a lot of network communication overhead and I/O overhead.Firstly,in this paper,for big table join task,we present a new Spark join way based on Bloom Filter,this way can filter many record that do not meet the join condition beforehand.Secondly,this paper proposed a new network traffic big data platform base on the improved Spark.In the platform,we describle how Spark and componets in Spark ecosystem use to solve the processing of network traffic big data.Finally,we verifined the work of this paper by two experiments,the new Spark join way can reduce network communication overhead,and reduce the time of shuffle phase,and in the processing of network traffic big data,Spark show better performance than Map Reduce.

Keywords/Search Tags:

Spark, HDFS, BloomFilter, Join, NetFlow

PDF Full Text Request

Related items

1	Research And Implementation Of Economic Dynamic Management System Based On Spark Technology
2	Research On Query Analysis And Optimization Based On Spark System
3	Implementation And Optimization For Join Operation In Spark
4	Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine
5	Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform
6	Query Optimization In Spark SQL For Business Data Of 4G Industry Card Based On HDFS
7	Research On Equi-Join Optimization Algorithms On Spark SQL
8	A Malicious Flow Monitoring And Analysis System Based On Spark Platform
9	Implementation And Evaluation Of Big Data Parallel Join Algorithms
10	Design And Optimization Join Algorithms Based On Map Reduce