Research On Data Dispatch Technology In Distributed Stream Join Systems

Posted on:2024-09-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S Y Yu

Full Text:PDF

GTID:1528307319462514

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of big data and the Internet of Things,the speed of data generation shows an explosive growth trend,and the demand for time-efficient data processing is increasing.In order to effectively cope with the challenge of time-efficient computing of stream big data,distributed stream processing systems have emerged.Among them,stream join is widely used to extract the correlation information between multi-source stream data.As a basic operation in stream processing systems,it has become the research focus in the field.In the distributed stream join system,the data dispatch strategy is the key to ensure low processing latency,high throughput,and scalability.However,the skewed and multi-source characteristics of stream data bring serious problems such as high processing latency,low throughput and poor scalability to distributed stream join systems.First,the skewed distribution of stream data range leads to load imbalance: since the number of tuples in a certain range of stream data accounts for a large proportion of the total number of tuples,when the system adopts a fixed range dispatch strategy,the processing unit receiving a large proportion of tuples is overloaded and cannot process stream data in real time,resulting in high processing latency.Second,the skewed key distribution of stream data leads to load imbalance: since a small number of high-frequency key-value tuples account for a large proportion of all tuples,when hash dispatch is adopted,the system dispatches tuples of the same key to the same processing unit,which will cause the processing unit receiving high-frequency key tuples to be overloaded,while the processing unit receiving only low-frequency key values is idle,resulting in low throughput.Finally,the complex multi-source join process of stream data leads to poor scalability: the multi-source data characteristic makes the system distributed store tuples from multiple streams to be joined.In order to ensure the completeness of the join result,the system needs to transmit and join the intermediate result among multiple processing units to generate the final result.However,the transmission and join processes increase with the number of streams,which makes the process of generating the final result complicated,resulting in high processing latency and poor scalability.To solve the above problems,researches on data dispatch technology in distributed stream join system have been carried out.The specific contributions include:A stream join adaptive dispatch strategy for skewed distribution is proposed,which solves the problem of load imbalance caused by the skewed distribution of stream data range.By dispatching the data into different range partitions of the stream join system in a fixed range,where the partition is a unit that stores a range of data,and using real-world datasets for testing,it is verified that data skew leads to load imbalance and high processing latency in the system.To this end,a migration benefit model based on benefit integration is designed,which integrates the change in the number of partitions and load imbalance caused by migration as the change in queuing theory stay time,and then measures the benefits of migration.The most beneficial migration scheme is the one with low migration cost and a controllable number of partitions.Furthermore,in order to reduce the cost of querying adjacent partitions during migration,a partition connection graph for processing unit queries is further proposed.Experimental results on large-scale real data show that compared with existing designs,the system based on the adaptive distribution strategy reduces the average processing latency by 56%,and increases the throughput by 62%.A dynamic dispatch strategy based on load-aware orientation is proposed,to solve the problem of load imbalance caused by the skewed key distribution.The existing dynamic dispatch strategy randomly dispatches the tuples of the migration key to a processing unit,and meanwhile broadcasts the tuples of the migration key in another stream to all the processing units,resulting in increased network and join computation cost.Therefore,a selection method of load-aware detection is proposed,and some processing units are selected as the immigration processing units of the emigration key.This method designs a load-aware detection selection method with increasing number of incoming processing units,and evaluates the impact of emigration key on system performance through a performance evaluation model based on queuing theory measurements.In addition,the dynamic dispatch strategy needs to use the approximate representation structure of the data stream set as the route of the migration key in the distributor.However,the existing approximate representation structure of the data stream set causes the system to generate a large amount of discrete memory access cost during key query.The experimental results on large-scale real data show that,compared with the existing design,the system based on the load-aware dynamic dispatch strategy reduces the average processing latency by 31%,and increases the system throughput by 13%.A symmetric dispatch strategy for scalable multi-source stream join is proposed,to solve the problem of poor scalability due to the complex multi-source join process of stream data.By analyzing the importance of intermediate results in the multi-way join execution tree,and counting the repeated occurrences of tuple key,it is found that the generation of final results depends on the transmission of intermediate results,and the problem of repeated storage and sending of tuples in intermediate results.In order to reduce the processing latency of the system,a symmetric dispatch strategy is designed to decouple the dependence of the final result on the intermediate result.The symmetric dispatch strategy enables each tuple to be joined with the intermediate result to generate the final result immediately after arriving in the system.Furthermore,an intermediate data dispatch structure based on a dynamic attribute index graph is proposed to reduce the number of intermediate results,and reduce the computation cost.The experimental results on large-scale real data show that,compared with the existing design,the multi-source stream join system based on the symmetric dispatch strategy reduces the processing latency by 29%,while increasing the system throughput by 12%.

Keywords/Search Tags:

Stream processing, Stream join, Distributed system, Data dispatch, Data skew

PDF Full Text Request

Related items

1	Result Completeness Guarantee Strategy Studies In Distributed Stream Join Systems
2	Distributed Stream Join System Load Balance Strategy Studies
3	Research On Anti-skew Partitioning Method Based On High-frequency Key Perception In Stream Processing Systems
4	Research On Stream Join Algorithm And Parallelization Based On Big Data Platform
5	Study On Data Stream Techniques And Its Application In Electric Power Information Processing
6	Research On Distributed Data Stream Query Processing
7	Research On Load Management Technology In Distributed Data Stream Processing
8	Research And Implementation Of Data Stream Management System
9	Research On Key Technologies In Analysis Processing Of Network Security Monitoring Data Stream
10	The Distinct Element Problem In Equi-join For Multiple Data Streams