Font Size: a A A

Research On Managing And Querying Upon Big Trajectory Data

Posted on:2019-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z G ZhangFull Text:PDF
GTID:1368330563455382Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently,big trajectory data is generated in an explosive way with the popularity of smartphones and other location-acquisition devices.These devices,monitoring the motion of vehicles,people,animals and goods,are producing massive distributed trajec-tories rapidly.These data not only reflect the spatio-temporal mobility of individuals and groups,but also contain the behavior information of moving objects.They are invaluable for applications such as route planning,urban planning and logistics scheduling.To make full use of such data,tremendous efforts have been made to support effective trajectory data management analysis,including trajectory indexing,trajectory clustering,trajectory classification,anomaly detection and behavior prediction.In this dissertation,the problem of how to manage the big trajectory efficiently on a computer cluster is researched firstly.Existing trajectory management systems fail to provide scalable and low latency query services over such data due to the high I/O cost.In view of that,Traj Spark,a distributed in-memory system to consistently offer efficient management of trajectory data is presented.Traj Spark introduces a new abstraction called IndexTRDD to manage trajectories as a set of segments,and exploits a global and local indexing mechanism to accelerate trajectory queries.Furthermore,to alleviate the essen-tial partitioning overhead as the increase of data volume,it adopts the time-decay model to monitor the change of data distribution and updates the data-partition structure adaptively.This model avoids repartitioning existing data when new batch of data arrives.Next,we focus on the k nearest neighbor query over the distributed trajectory data.Although this query can be solved by directly sending the query reference to all remote sites,in which the pairwise distances are computed precisely.However,the overall com-munication and computation cost is huge.Besides,there are a number of trajectory dis-tance measures and existing work only focus on one of them,thus these work are short of generality.To address above challenges,Some communication-saving ways is devised to estimate pairwise distance by using sketch data,which allow filtering some trajectories in advance without precise computation.Besides,two general frameworks is proposed to deal with different distance measures.The first one prunes candidates by using both dis-tance lower and lower bounds,while the second one only uses distance lower bound.So,the former one suites for cases where both distance lower and upper bounds can be com-puted by using sketch data,and the latter one suites for cases that only lower bound can be computed.Finally,Euclidean distance is embedded in the first framework and DTW distance is embedded in the second one,to evaluate the practicality of these frameworks.The research questions and technical contributions in this thesis can be summarized as follows:1.TrajSpark system is proposed to support low-latency query over big trajectory data.Existing trajectory systems are built on top of disk-based systems and suffer the problem of I/O overhead.So,Traj Spark which utilizes distributed memory toTraj Spark builds IndexTRDD,an RDD of trajectory segments,to support efficient data storage and management by incorporating a global and local indexing strat-egy.Besides,Traj Spark monitors the change of data distribution by importing a time decay model which alleviates the repartitioning overhead occurred in existing distributed systems and gets a good partition result at the same time.Extensive ex-periment results demonstrate the superiority of Traj Spark over other Spark-based systems.2.Pruning strategy FTB is proposed by using distance lower and upper bounds to solve the knn query,and algorithm ED-FTB is designed to embed Euclidean distance onto FTB.Trajectory compression is a practical way to reduce the commu-nication cost.Using distance bound(s),which are computational cheaper than orig-inal distance,to prune candidates is an efficient way to improve query efficiency.We combine these two methods and design the pruning strategy FTB by computing lower and upper bound from sketch data which is computed from proper compres-si on method.To evaluate the efficiency of FTB,ED-FTB algorithm which embeds Euclidean distance in FTB strategy is proposed.ED-FTB adopts Haar wavelet co-efficients as multi-resolution sketch data.Both new lower and upper bounds are designed for ED-FTB,and it is proved that both the bounds get tighter when finer-grained sketch data is used.Theoretical analysis and extensive experimental results show that ED-FTB outperforms the state-of-the-art algorithm.3.Pruning strategy FLB is proposed by using distance lower bound to solve the knn query,and algorithm DTW-FTB is designed to embed DTW distance onto FLB.For the case when upper bound cannot be computed from sketch data,the pruning strategy FLB which uses only distance lower bound is proposed.FLB com-putes a global pruning threshold to approximate the k-th smallest distance.To eval-uate the efficiency of FLB,DTW-FLB algorithm which embeds DTW distance in FLB strategy is proposed.In DTW-FLB,bounding envelopes is adopted as multi-resolution sketch data and design a lower bound for DTW distance by using such data.The lower bound gets tighter when finer-grained bounding envelope is used.Early-abandoning pruning policy is adopted in DTW-FLB to improve the query ef-ficiency.Extensive experimental results show that DTW-FLB algorithm is efficient and scalable.In summary,this dissertation researches the topics of trajectory data storage manage-ment and query processing.To store the big trajectory in a cluster under a query-efficient way,a distributed in-memory based system is proposed.In this system,a memory-suit trajectory data structure is designed,and a global-local indexing strategy is proposed.To ensure the scalability of the system,time-decaying model is adopted to self-adjust to the change of data distribution as more data are loaded.Efficient algorithms by us-ing indexes are designed to process typical trajectory queries.In addition,for the case when big trajectories are stored in spatially-distributed nodes,how to solve the knn query in a communication-saving way is researched.The pruning strategy which uses multi-resolution sketch data to compute upper and lower bounds is devised.Next,for the dis-tance who cannot supply a upper bound,pruning strategy which only uses lower bound and a global threshold is also devised.Euclidean and DTW distance are embedded to those two strategies separately to evaluate their correctness.Theoretical analysis and extensive experimental results show the efficiency and scalability of the proposed algorithms.
Keywords/Search Tags:Trajectory, Data Management, knn query, Distance Bounds, Communication Cost, Time Efficiency
PDF Full Text Request
Related items