Font Size: a A A

A Distributed Computing Framework to Manage, Query, and Analyze Big Geospatial Data for Urban Studies - Case Studies with Urban Heat Island and Tourist Movement Pattern Minin

Posted on:2019-06-07Degree:Ph.DType:Dissertation
University:George Mason UniversityCandidate:Hu, FeiFull Text:PDF
GTID:1448390002971022Subject:Geographic information science and geodesy
Abstract/Summary:
Urban system, as a sub-component of the Earth system, is complex and dynamic being composed of numerous interactions among natural, human-built, and social entities (Jing et al., 2011). Although urban areas occupy only 3% - 4% of the land surfaces, the dynamic change and expansion are impacting our urban environments from local to global scales at unprecedented levels, including carbon cycling through land cover change, surface energy fluxes, and urban heat island (Karen et al., 2014). Various geospatial data sources, including satellites, sensors, numerical models, mobile phones, and social networks, provide data to record and simulate how the urban systems work (Yang et al, 2017a).;Ideally, these big geospatial data are provided to scientists with on-demand processing and analytical capabilities to help them understand the complex urban systems. However, these big geospatial data with various data sources and complex content pose several grand challenges: 1) Volume: the size of geospatial data has far exceeded the capability of any standalone data storage system. Such big data require scalable storage architecture, but the existing distributed data containers are not fully ready to handle them for several reasons, including disk- and computing- intensive data preprocessing, complex system maintenance procedures, and low efficiency of spatiotemporal data query (Hu et al., 2018b); 2) Variety: data sources are usually stored in different data formats with a variety of dimensions, attributes, resolutions, and contents. The complex nature of geospatial data increases the difficulty of preprocessing and retrieving the datasets for urban studies; 3) Value: discovering the underlying mechanism of urban systems requires both high-performance computing resources and advanced data analytic technologies (e.g., measuring land system architecture, simulating land use and land cover change, and detecting human dynamical patterns) to discover the interactions among the urban system's components. In summary, such easy-to-use but powerful frameworks are urgently needed for urban studies to embrace the efficient big data management and advanced data analytics in a scalable environment.;To address these challenges, this dissertation proposes a distributed computing framework based on Apache Spark and Hadoop Distributed File System(HDFS), named SparkCity, to integrate the advanced data management technologies and data analytical methods. I design a scalable computing framework from the aspects of distributed geospatial data storage, hierarchical indexing for fast query, and parallel data analytics considering the special features of geospatial data content. The framework aims to fill the gap between the distributed computing framework (Spark + HDFS) and big geospatial data processing. The capabilities of the prototype system are demonstrated in two case studies: 1) detecting the impact of land system architecture and socioeconomic factors on urban heat island; 2) tourist movement pattern mining from social media data. Additionally, the proposed methodologies for handling big geospatial data can be applied to other geoscience studies that involve big geospatial data.
Keywords/Search Tags:Data, Urban, Studies, Distributed computing framework, System, Complex, Et al, Query
Related items