Research On Optimization Of Streaming Data Preprocessing Mechanism In Cloud Environment

Posted on:2019-08-27

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2428330566999339

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the background of cloud environment,streaming data as a processing mode of big data has become a research hotspot.Hadoop shows excellent performance in data storage,which attracts the attention of well-known IT enterprises at home and abroad.However,in the cloud environment,the streaming data of Hadoop is with the problem of data reliability reduction and clustering performance degradation,which calls the improvment of persistence algorithm in the cloud environment.Spark Streaming is an extension of the Spark core API,which enables high-throughput,fault-tolerant,realtime streaming data processing.Since the state of streaming data state and cluster computing power are dynamically changes,spark streaming cluster parameters need to change accordingly in order to ensure data processing efficiency.This thesis mainly studies the optimization of streaming data preprocessing,which includes the following three fields.Firstly,the streaming data persistence algorithm based on HDFS dynamic replica allocation technology is proposed to solve the problem of data reliability reduction and clustering performance degradation of HDFS storage streaming data in the cloud environment.The algorithm perceives the location of cluster nodes in cloud environment defines the cluster node distance through the concept of relative position,and use the distance index of cluster nodes and the performance index of nodes as the condition of flow data persistence to ensure the stability of persistent data reliability and cluster data processing performance.Simulation results show that when the physical server is down,the algorithm can guarantee the data is not lost in the cloud environment.This algorithm is also compared with other similar algorithms in data reliability and cluster performance.The experiment results show that in cloud environment,the algorithm is better in data reliability and cluster performance.Secondly,this thesis proposes a streaming data processing algorithm based on spark dynamic block adjustment technology,which is able to solve the problem of the low efficiency of cluster data processing.The algorithm analyzes the current load information data streaming,and look up the task properties that are similar to the current load flow and are the most efficient data processing in the historical database table as the current parameters.If the data stored in the database is less,it can automatically adjusts parameters by DAAC(Dynamic adaptive adjustment control)controller to achieve the goal of real-time streaming data processing.The simulation analysis is used to compare the real-time performance of this algorithm with the traditional streaming data processing algorithm in the task processing.The result shows that the algorithm has high real-time performance.At the same time,the utilization rate of CPU and memory utilization rate of this algorithm is compared with other similar algorithms.The experiment result shows that under the same load condition and the algorithm convergence condition,this algorithm takes up less CPU resource and memory than other similar algorithms.Finally,the prototype system is designed for the two algorithms mentioned above,which demonstratesthe performance of the two algorithms in response to streaming data storing and streaming data processing.The streaming data persistence algorithm based on HDFS dynamic replica allocation technology can guarantee the reliability of data and the stability of cluster performance.The streaming data processing algorithm based on Spark dynamic block adjusting technology can guarantee real-time data processing.

Keywords/Search Tags:

Cloud Computing, Streaming Data, Replica Allocation, Parameter Optimization

PDF Full Text Request

Related items

1	The Research On Data Replica Management Strategy In Cloud Computing
2	New Multiple-replica Storage And Provable Data Possession Schemes In Cloud Computing
3	Research On Replica Optimization Strategy In Data-intensive Computing
4	Research On Optimization Of Big Data Storage Replica Strategy In Cloud Environment
5	Research Of Replica Management Mechanism For Integration Of Cloud-P2P Computing
6	Research On Replica Optimization In Data-intensive Computing
7	Research On Data Placement And Replication Strategy In Cloud Computing
8	Research On Key Technologies Of Distributed Storage In Cloud Computing
9	Research On Dynamic Adaptive Streaming Over Cloud Computing
10	Research On Multi-replica Provable Data Possession Under Cloud Computing