Font Size: a A A

Research On Key Technologies For Batch-stream Computing Platform

Posted on:2020-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:N YunFull Text:PDF
GTID:2428330605967975Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and cloud computing,the scale of data has grown rapidly,and the application scenarios are becoming more and more complex.Internet enterprises,such as Microsoft,Yahoo!,and open source community Apache have developed many processing tools to deal with big data,from batch computing to stream computing and real-time interactive computing.Different frameworks have different application scenarios,scenes with a large amount of data and low real-time requirements are suitable for batch processing systems,and stream processing systems have fast response capability,but can not calculate the results repeatedly and require high hardware.Batch and stream systems usually require two sets of clusters to be deployed,resulting in increase of deployment and operational costs.However,a single computing framework can no longer to deal with complex data,so the industry urgently needs a framework that can deal with batch and stream data,and to reduce the cost of data management and operations.At present,there are some frameworks that can handle batch and stream data,such as Flink,Blink in Alibaba,and Oceanus in Tencent,which are designed to integrate various systems and provide different computing services.However,the integration of batch and stream platforms still face a series of challenges,how to manage the data and permissions in a unified way,and how to unify the expression and query,and how to improve the query,and to ensure the accuracy of the results.In view of the above problems,this paper focuses on the following two aspects of research:On the one hand,in order to unify the expression and query of batch and stream data,so dynamic tables and revocation operations are introduced,and a scheme for converting any standard SQL into real-time computing program is also formed,which ensure the correctness of the query,and the purpose of using a set of SQL to query batch and stream data is achieved,and unify the expression and query.At the same time,in order to manage and maintain metadata and permissions uniformly in different systems,Meta Service,a metadata management system,is proposed,and realize the unified management of permissions based on Meta Service,which can achieve the effect that data generated in any data system can be used seamlessly in other systems.On the other hand,in order to analyze and optimize the query performance of Catalyst,that is an optimizer of Spark,the optimization strategies for SQL are studied and analyzed from two aspects: rule-based optimization(RBO)and cost-based optimization(CBO).Meanwhile,some of the optimization rules,such as Combine Filters,are selected for the performance of Catalyst,and under the condition of varying the scale of data and cluster,the optimization effects of RBO and CBO are studied respectively.Finally,some optimization suggestions are given for the parameters,including data processing parallelisms,and the parameters of Driver and Executor,and broadcast optimization,and the experiments verify the effectiveness of the optimization suggestions.The key technologies of batch-stream platforms can be used to unify the expression and query of batch and stream data,and improve the processing abilities and reduce the cost of cluster constructions,meanwhile,metadata and permissions are managed and maintained in a unified way,and realize the efficient management.The analysis of Catalyst's optimization for SQL can provide inspirations and references for community developers to improve the performance of the optimizer.
Keywords/Search Tags:batch-stream integration, metadata management, permission management, query optimization
PDF Full Text Request
Related items