Font Size: a A A

A Hadoop-based storage system for big spatio-temporal data analytics

Posted on:2013-05-01Degree:Ph.DType:Thesis
University:Hong Kong University of Science and Technology (Hong Kong)Candidate:Tan, HaoyuFull Text:PDF
GTID:2458390008988307Subject:Computer Science
Abstract/Summary:
During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. As the size of the data is continuously growing, it will outgrow the capabilities of any serial processing techniques and it is therefore necessary to perform the data analytics in parallel.;There are two main paradigms for large scale data processing: parallel relational database management system (RDBMS) and MapReduce. The debate on which paradigm is superior to the other has lasted for several years, which led to a widely accepted view that there are advantages to both paradigms in different aspects. It was once believed that RDBMS can deliver a better performance while MapReduce can scale out more easily due to its emphasis on fault tolerance design. However, recent works from both sides demonstrate that techniques used in one paradigm can be incorporated into another to fix the deficiencies. In the context of our research, we use Hadoop [1], an open-source implementation of MapReduce and related components, to perform spatio-temporal data analytics. The main consideration is that Hadoop provides us with low-level application programming interface (API) which is more flexible for implementing complex data mining algorithms than structured query language (SQL) supported by RDBMS.;In this thesis, we first describe the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scanning the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we propose scalable bulk loading algorithms capable of incrementally adding new data into the system. Then we address the problem of parallel creation of secondary indexes in CloST. Particularly, we present the design of a general framework for parallel R-tree packing using Hadoop. This framework sequentially packs each R-tree level from bottom up. For lower levels that have a large number of rectangles, we propose a partition-based algorithm for parallel packing. We also discuss two spatial partitioning methods that can efficiently handle heavily skewed datasets.;To evaluate the performance, we conduct extensive experiments using large real datasets. The size of the datasets is up to 200GB and the number of spatial objects is up to 2 billion. The results show that CloST has fast data loading speed, high scalability in query processing, high data compression ratio, and desirable performance in building very large secondary spatial indexes (R-trees).
Keywords/Search Tags:Data, Storage, Hadoop, System, Large, Processing
Related items