Font Size: a A A

Research And Application Of Key Technology Of Big Data Stream Computing Based On Storm

Posted on:2018-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2348330518973583Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Large-scale real-time stream computing,as an important part of big data computing,has been widely used in real-time statistics,real-time recommendation,real-time monitoring and personalized service,etc.There are significant differences between big data stream computing and traditional big data batch processing in the data processing requirements,methods and so on.Thus more and more scholars began to focus on research big data stream computing.Apache Storm is one of the most well-known and the most representative stream computing engines.Based on the Storm,the typical big data stream computing is decomposed into three successive stages: data acquisition,data calculation and data storage,according to the widely accepted system engineering method.The three stages also constitute a big data stream processing chain.The following problems will be studied for the big data stream computing in the current actual production.First of all,data acquisition has the following problems and difficulties:(1)Traditional data acquisition method of acquisition system with the data source has a very high coupling problem and its access may be uncontrollable;(2)Each data source independently maintain their own acquisition pipeline,with many data acquisition interface and chaotic format,thus resulting in the collection interface more difficult to expand and maintain;(3)It is difficult to integrate multi-source heterogeneous data source in distributed environment.Second,the data calculation based on Storm has the following problems:(1)Stream computing requires that the system be able to cope with the sudden changes in data stream rate,while Storm lacks the ability to dynamically match this uncertain data stream;(2)When the system is overloaded,Storm appears to significantly increase the delay and the system is unstable;(3)Storm must pause the system for the redistribution of topology resources,which may lead to longer data calculation delay and data loss.In addition,the number of tasks in the calculation topology cannot be modified since the allocation,so the allocation of resources is also limited by the setting of Topology.Finally,the data storage has the following difficulties:(1)When the connection pool provided by an open source database such as HBase is used to accommodate the high concurrent Strom real-time stream data dump,its write performance cannot meet the requirements;(2)The IO block or error caused by the persistence layer may write a dramatic change in performance.In view of the above problems,setting the technology project of cloud manufacturing service platform that laboratory performs for elevator industry alliance as study background,this paper aims to build real-time monitoring system of cloud manufacturing service platform and optimize the large data flow calculation.Cutting from the characteristics of big data stream computing,surrounding the three stages of big data stream computing chain: data acquisition,data calculation and data store,the paper studied the key technology of big data stream computing based on Storm,solved the above difficulties and problems realized the real-time monitoring system of cloud manufacturing platform,to provide real-time monitoring service for cloud manufacturing platform.Compared with the same type of work,this paper has the following contributions:(1)For the data acquisition,this paper proposes a hierarchical acquisition strategy for streaming data,which realizes the integration of multi-source and heterogeneous data sources in distributed environment and the fault tolerance of the acquisition process,solves the high coupling problem of data acquisition system and data source and uncontrollable access problem.On the basis of this strategy,the load balancing method of data acquisition tool Flume is optimized,and a two-layer hash load balancing method is proposed,which reduces the data mobility of the distributed nodes and makes the overall load of the system as fair as possible.The experimental results show that the load balancing method can improve the throughput of 32%,the data mobility decreased to 2%.(2)For the data calculation,this paper designs and realizes three optimization methods for the Storm problems such as the significant increase in the data calculation delay,the system instability and the lack of large data flow dynamic matching ability,when the system is overloaded.These three methods are dynamic step-by-step backpressure strategy,no-perception topology replacement mechanism and parallel data backflow method.The experimental results show that compared to the engine implemented by Storm's default implementation,the optimized engine has following advantages:(i)Effectively improve system throughput,in the best case by 10% to 25%,the worst case with Storm default implementation close to the same;(ii)Effectively improve the data calculation delay,the best case to improve the processing time of 25%,and can inhibit the system load oscillation;(iii)When the resource is dynamically adjusted,the user does not perceive that the system does not need to be paused and the allocation of resources is not limited by the settings before the system is running.(3)For data storage,this paper presents a storage optimization method for streaming data delay persistence,including the delayed persistent storage mechanism and the method of batch submission.The experimental results show that the delayed persistent storage can delay the IO reading and writing of the disk to the data processing,improve the writing performance of the persistence layer and shield the influence the data persistence layer has on the stream computing engine.For the batch submission optimization method of writing thread pool,the amount of operand written can be increased by 36%.(4)On the basis of the above work,this paper designs and realizes the real-time monitoring system of cloud manufacturing platform,provides real-time monitoring service for cloud manufacturing platform and verifies the practicability and validity of the research content.
Keywords/Search Tags:Apache Storm, performance optimization, stream computing, data calculation, data acquisition, data storage
PDF Full Text Request
Related items