Research On Techniques And Systems For Big Data Processing

Posted on:2017-04-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:R Gu

Full Text:PDF

GTID:1318330512954091

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid development and wide application of information technology, the amount of valuable data explosively increases, leading the arrival of the Big Data era. The Big Data has attracted intensive attention of industry, academia and governments of many countries. Big data has become the strategic assets of countries and enterprises.It brings both significant opportunities and challenges to the world. From one side, large scale data resources contain enormous values for business and society. Effectively managing and using these data resources and mining the deep values from data will definitely benefit state governance, society management, enterprise decision-making, and even the daily life of individuals.On the other hand, when bringing a lot of opportunities, big data also brings a lot of technical challenages for processing. Various formats, complicated forms, and large volumes of big data make the traditional computing technology can hardly process big data. As a result, we need to work from several technical layers of computing technology, adopting new techniques and methods, to provide effective techniques and means for big data processing.Effective big data processing involves major technical obstacles on data storage, computation, and analytic layers.First of all, up to hundreds of TB-scale or even PB-scale of big data has gone much beyond from the ability of traditional database systems. Thus, we need to research and develop effective techniques and systems for distributed data storage and management.Second, large-scale data processing is a heavily time-consuming process, which makes the traditional single-machine computing unable to meet the performance demands of big data processing. Thus, we need to research and develop efficient parallel computing techniques and systems for big data processing.Furthermore, big data analytics involves large-scale data mining, which makes almost all of traditional single-node machine learning and data mining algorithms can hardly complete the computation in acceptable time, making algorithms invalid. Thus, we need to research and develop effective parallel big data machine learning and data mining algorithms and the related systems.Big data processing is of a typical feature that differs from traditional computing and information processing. It is a comprehensive technology that is associated with many aspects of computing and information processing, having remarkable technical natures on comprehansiveness and correlation. Using single or isolated technique will not able to effectively process big data. Therefore, effective big data processing will need to tightly combine and integrate data storge, computing, and analytic layers, forming an integral technical stack and creating a unified system and platform for big data processing.Based on above research issues and backgrounds, this dissertation conducts a series of research work on major aspects of big data processing, including distributed data storage techniques and systems, parallel computing techniques and systems, and parallel big data machine learning algorithms and systems.More specific, the dissertation includes the following major research work and contributions:(1) Research on distributed data storage and management techniques and systems, include three aspects of work:1) Research and implement a cache eviction scheduling framework along with several eviction policies for hierarchical big data storage systems, which can remarkablely improve the data access performance for distributed data storage systems.2) Research and implement a unified benchmarking method and system tool for testing the performances of various distributed file systems, which can be used to improve the distributed file system design or used for selecting optimal file system and optimizing the file system configurations for application development.3) Research and implement a distributed and hierarchical semantic data storage and management system to efficiently managing large-scale RDF semantic data.(2) Research on perforamnce optimizations for main-stream parallel big data computing systems, including two aspects of work:1) Performance optimization for short Haddop MapRedcue job execution. Optimized MapReduce job and task scheduling methods are proposed and implemented with instant communication mechanism for task scheduling and status reporting. A new compatibal version of Hadoop has been implemented.2) Spark RDD data off-heap storage mechanism is proposed and implemented. A distributed off-heap memory storage system is adopted to persist RDD data to avoid the performance decline caused by frequent JVM gabage collection (GC).(3) Research on parallel computing methods and algorithms for big data machine learning and data analytics, including several complex big data machine learning and data analytic algorithms:1) Research and implement a customized parallel computing platform for large scale neural network training with the backpropagation algorithm, which can effectively overcome the low performance issue of large-scale neural network training.2) Research and implement a parallel Gradient Boosting Regression Tree (GBRT) training algorithm based on K-Means histogram approximation for big data information retrival applications.3) Research and implement efficient parallel semantic reasoning algorithms and engines for the widely-used RDFS and OWL Horst rule sets.(4) Research on unified programming model and system for big data machine learning and data analytics. Targeting for the programmability and ease-to-use issues of big data platforms and low computational performance issues of big data machine learning and data analytics, a matrix-based unified programming model and framework are proposed for big data machine learning and data analytics. Further a cross-platform and unified prototype system, Octopus, is implemented for big data machine learning and data analytics. Optopus can transparently integrate main-stream big data systems such as Hadoop, Spark, MPI, and Flink, and allow data analytic programmers to use R/Python programming language and development environment to conveniently design algorithms and write codes for big data machine learning and data analytics.By comprehensive researches on distributed data storage, parallel computing, and data analytic layers, a series of outcomes is achieved, which can be used as effective supporting techniques and systems to construct the unified big data processing platform. Several techniques and systems from this dissertation have been successfully adopted and contributed to open-source or commercialized versions of big data systems or applications in industry.

Keywords/Search Tags:

big data processing, big data storage and management, parallel computing, performance optimization, parallel machine learning algorithms, big data programming model, big data machine learning system

Related items

1	Research On Data Parallel Communication Strategy For Distributed Machine Learning System
2	Research And Parallel Application Of Supervised Learning Algorithms For Large-scale Data Classification Problems
3	Research And Optimization Of Parallel Extreme Learning Machine Algorithm For Big Data
4	Research On Data Mining Algorithm In The Electric Power Cloud Data Analysis Platform
5	Research On Explicit Topology Optimization Considering Parallel Computing And Data Driven
6	Research On Key Methods Of Parallel Machine Learning Model Training
7	Performance optimization of memory-bound programs on data parallel accelerators
8	Accelerated Prototype Of Distributed Big Data Based On Divisible Load Scheduling
9	Research On Scalable Parallel Algorithms For Big Data Processing
10	High Dimensional Multispectral Data Classification By Machine Learning