Font Size: a A A

RSP:A New Approach For Approximate Big Data Analysis

Posted on:2020-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Salman SalloumFull Text:PDF
GTID:1368330599954822Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Analyzing data sets in the terabyte scale is a challenging task to data scientists aiming at quickly revealing valuable insights from big data.To cope with the ever-increasing data volume,approximate computing has emerged as an efficient and cost-effective paradigm for big data analysis.However,applying sampling-based approximation faces challenges in cluster computing frameworks that implement a shared-nothing architecture,e.g.,Apache Hadoop and Apache Spark.Since the scalability of these frameworks is limited to the available resources,online sampling operations that scan the entire data set becomes prohibitive.On the other hand,the quality of block-level samples depends on the data partitioning scheme in Hadoop Distributed File System(HDFS).In this dissertation,we combine cluster and approximate computing in a new sampling-based approximation approach,called the RSP approach.This approach enables data scientists to explore and analyze big data on computing clusters when data volume goes beyond the available resources.We propose the Random Sample Partition(RSP)distributed data model to facilitate online random sampling from big data in distributed file systems.In this model,a big data set is represented as a set of ready-to-use disjoint random sample data blocks,called RSP blocks.A Two-Stage Data Partitioning(TSDP)method is developed to generate an RSP from an HDFS file.The RSP model significantly decreases the online sampling time and improves the quality of block-level samples.RSP blocks can be used directly to estimate the statistical properties of the entire data.We present a theoretical analysis of the RSP model and prove that estimates from RSP blocks are unbiased and consistent estimates.We show that Block-level sampling from an RSP is more efficient than record-level sampling and more effective than block-level sampling from a normal HDFS file.Given an RSP on a computing cluster,we follow a new strategy for approximate big data analysis using only a few RSP blocks.A block-level sample of RSP blocks is randomly selected and processed in parallel with existing sequential algorithms.The results from these RSP blocks are combined gradually to obtain approximate results that are asymptotically equivalent to the true ones.We apply the RSP approach to scale iterative algorithms to big data and build ensemble models from block-level samples of RSP blocks.To this end,we propose an asymptotic ensemble learning framework,called Alpha framework.This framework enables big data computing with RSP blocks on small computing clusters using the mainstream cluster computing frameworks,distributed file systems and sequential data analysis and mining algorithms.We develop a prototype of Alpha framework using R packages with HDFS.To help data scientists quickly explore big data before applying advanced algorithms,we also propose the RSP-Explore method for big data exploration and cleaning with RSP blocks.In this method,a block-level sample of RSP blocks is used to estimate the statistical properties of both clean and dirty data.We conducted experiments on a small cluster of 5 nodes using real and synthetic data sets.The experimental results show that a few RSP blocks of a big data set are sufficient to obtain approximate results that are equivalent to those computed from the entire data set(e.g.,in classification,regression,and summary statistics).Furthermore,cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the entire unknown clean data.The RSP approach can break the limit of in-memory cluster computing and enable big data exploration and analysis where the entire data set cannot be computed.
Keywords/Search Tags:Big Data Analysis, Approximate Computing, Cluster Computing, Random Sampling, Data Partitioning
PDF Full Text Request
Related items