Research On The Key Technologies Of Large-Scale Real-Time Data Streams Joins

Posted on:2016-10-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X C Liu

Full Text:PDF

GTID:1228330467490516

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Recently, as the development of Internet, Sensors and Internet of Things(IoT), the speed of generating data is increasing. Many valuable information which can bring us facilities of daily life lurked in these data.In many applications, the information are supplied to users in the form of stream which has strong timeliness, and need to process in "on-the-fly" form. Whatâ€™s more, because of resource constrains and the characteristic of applications, single stream can only provide parts of information, users need to combine multiple streams for the pur-pose of getting complete information. As one of effective means of getting general information, join plays an important role in the processing of data streams.With the coming of the ear of big data, computing on a single machine could not meet the requirement of streams joins. The cluster consisting of many "shared-nothing" computers becomes one of effective means. Based on the deep study and summarize the related work, we focus on the multi-way streams joins in distributed environment in this dissertation, and it consists of the following contents:First of all, we give the Compressed Histograms building algorithm based on in-cremental computing under stream model. Under the error we defined, Compressed Histograms can reflect the data distribution of current sliding windows approximately and it can provide necessary information for the join optimization in the following parts.Secondly, we propose the Pipeline Join algorithm based on the characteristic of two-way streams joins. The computing nodes are organized in the form of linear, and the two streams flow into the pipeline from the opposite directions. The algorithm can deal with equi-or unequi-joins and it can guarantee the integrity of results. Whatâ€™s more, we also proposed fault tolerance based on upstream backup, load balance mechanism similar as pressing plasticine and scalability policy based on pipeline model.Thirdly, a distribution policy based on consistent hashing for multi-way streams equi-joins is proposed. This policy can assure that the related tuples from different streams are routed to the same node. The load balance among all computing nodes can be guaranteed too. Whatâ€™s more, according to the distribution information provided by histograms, we give join tree building algorithm based on greedy, which can produce a relative optimized join order.Finally, we study the more general multi-way streams Î¸-joins, and propose the distribution policies based on range hash and sharing time slice. The two policies take the integrity of results and load balance into account. Moreover, they can decrease the number of backups to reduce the network transmission. A join policy named "Group Join" based on (key, valueList) form is also proposed, it can reduce the running time in some cases.The dissertation mainly focus on multi-way streams joins in the distributed envi-ronment, and proposes the Pipeline algorithm for two-way streams joins, the Consistent Hashing algorithm for multi-way streams equi-joins, the Range Hash and Sharing Time Slice policies for multi-way streams Î¸-joins. Moreover, a series of load balance, fault tolerance, scalability and optimization policies are given.

Keywords/Search Tags:

real time data stream, join, Compressed histograms, consistent hashing, (key,valueList), sharing time slice

PDF Full Text Request

Related items

1	The Distinct Element Problem In Equi-join For Multiple Data Streams
2	Quality Monitoring And Analysis Of Steel Products Based On Real-time Data Stream
3	Research On Real-time Transmission And Management Technology Of UAV Freight Big Data
4	Study On Correlation Analysis And Mining In Real-Time Data Streams
5	Research And Implementation Of Data Stream Processing Technology For Environment Monitoring
6	Research On Application System Based On Real - Time Data Flow
7	Real-time Image Sharing System For Smart Phones
8	Real-Time Interpretation And Optimization Of Stream Time Series In Big Data
9	Research On The Real-Time Identification Method For Data Stream Events With Data Drift Feature
10	A Real-Time Stream System Based On Batch-processing Schema