Font Size: a A A

Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data

Posted on:2016-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:S W ZhouFull Text:PDF
GTID:2308330479493949Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology and cloud computing,the data is growing in all walks of life,the era of big data has come.In the filed of telecom,the network traffic data,which is used to monitor Internet,is growing rapidly, too.Its storage and analysis need the support of big data technology.Hadoop has been introduced,HDFS and Map Reduce,two important components of the Hadoop ecosystem,solve the demand of network traffic data in a certain extent.But Map Reduce is a bacth data processing model,and it can not sovle the other data analysis demand,like interactive query and streming data processing.As a result of this,many different distributed computing frameworks,like Map Reduce,Impala and Storm, have been used at the same time.Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.There are many join operation of big table in the analysis of network traffic big data,and the join operation in spark has poor performance,the reason is there are many data that do not meet the join condition in the shuffle stage,and it will bring a lot of network communication overhead and I/O overhead.Firstly,in this paper,for big table join task,we present a new Spark join way based on Bloom Filter,this way can filter many record that do not meet the join condition beforehand.Secondly,this paper proposed a new network traffic big data platform base on the improved Spark.In the platform,we describle how Spark and componets in Spark ecosystem use to solve the processing of network traffic big data.Finally,we verifined the work of this paper by two experiments,the new Spark join way can reduce network communication overhead,and reduce the time of shuffle phase,and in the processing of network traffic big data,Spark show better performance than Map Reduce.
Keywords/Search Tags:Spark, HDFS, BloomFilter, Join, NetFlow
PDF Full Text Request
Related items