Font Size: a A A

Efficient Similarity Join Over Probabilistic Data Streams Based On Earth Mover's Distance

Posted on:2018-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:J Z ZhangFull Text:PDF
GTID:2348330542983651Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of emerging products such as video shring and social networking,the size of network data are following explosion.These data has the characteristics of complex structure and a large number.So the key data extracted from huge amounts of data is becoming more more difficult,especially taking similarity join in massive data.Similarity join,refers to the lookup from one or two data sources for all similar data,and returns the result.Similarity joins on probabilistic data play a vital role in many practical applications,such as sensor reading monitoring,stock analysis and object tracking based on multiple video sources.Earth Mover's Distance(EMD)proposed in Computer Vision is more effective in returning similar probabilistic data being more consistent to human's perception to similarity.However,the cubic time complexity of EMD hampers its wide application,especially in the analysis of fast incoming data streams.In this paper we,to the best of our knowledge,make the first attempt to address the EMD similarity join over data streams under sliding window semantics.The research work is mainly carried out the following aspects.For the EMD distance function optimization existing problems in the high complexity and long computing time,unlimited of data streams,put forward a kind of based on B + forest index framework of the EMD similarity algorithm which called the EMD-DSJoin.Design thought of the algorithm is using the theory of linear programming of the original dual convert reach probability date of the histogram to the EMD distance of lower bounds,then based on the EMD lower bounds distance to build a set of B+ forest index.Pruning for probability data of histogram use of B+ forest effectively for the don't need to make EMD computing,so as to speed up the efficiency of the EMD connection based on the similarity.Finally using sliding window to solve the problems of finite caches store delay data.Algorithm for concrete implementation method are as follows.First by building B+ tree forest and updating feasible solution to improve filtering effection,filter out completely relevant or completely irrelevant data,then by building sub-indexes,the data is discarded using the discarded sub-indexes,reducing the maintenance cost of the discarded data.Second is optimization of B+tree forest archive cycle and according to the variation of sliding window value and capacity factor,the value of the archive period P is achieved by the adaptive change,so that the B + tree forest index mechanism is more efficient.Validation carried out by using real world data sets the experiment and analysis results show that the EMD-DSJoin algorithm in CPU time,the number of EMD refinement has decreased to a certain extent,processing speed is faster than existing algorithms about 35%,which illustrated the EMD-DSJoin algorithm make the data more efficient pruning,to handle out-of-order data provide more effective treatment strategies.Data stream data to reach is not uniform,when the data arrives in a certain time period,due to the limited system resources,high data flow erupted cause system overload and leading to link performance declined dramatically.In order to solve this problem,this paper puts forward the load reduction strategy based on EMD-DSJoin algorithm.The strategy to fully consider the correlation data stream data with time which in system overloads to filter out the data contained in the redundant data,effectively reduced the number of similarities to connect,at the same time,as far as possible to ensure the integrity of the similarity join results.Experimental results show that using the EMD-DSJoin algorithm of load reduction strategy can be reducing the CPU time and the number of EMD refinement according to different set of discard threshold and verifies the feasibility and effectiveness of strategy.This paper is the first time use the semantics based on sliding window of the EMD process data flow similarity connection technology,put forward a series of strategies to improve the ability of EMD-DSJoin on data stream processing order delay probability histogram data,better solve the high data stream erupted lead to system overload.Thesis research results for connecting the data on the data flow similarity to provide new ideas and technical means.
Keywords/Search Tags:Data stream, Sliding window, EMD, Similarity join, EMDDSJoin, Load shedding
PDF Full Text Request
Related items