A Distributed Data Management System For Data Analysis

Posted on:2020-07-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Wu

Full Text:PDF

GTID:2428330599477509

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Industry 4.0 and Internet+,big data has become a common phenomenon in the industry,and it has gradually affected the lives of the public in all aspects.However,as the amount of data and the complexity of the model increase,the current system is becoming incapable and inefficient.It is of certain value and practical significance to study how to analyze and model massive data quickly and at low cost.Nowadays it is still believed that analyzing the full amount of data,which costs a lot,is the only way to fully analyze the data.The various overheads of existing distributed systems make the contradiction between the urgency and difficulty of big data analysis become more and more prominent,until the Random Sample Partition(RSP)study shows that analyzing a subset of data similar to the distribution of full data can achieve the effect of approximating the full amount of data.This theis is on a distributed data management system for data analysis based on random sample partition.A new method of big data management,which is based on random sample partition,is developed to provide fast and flexible support for data analysis modeling.This thesis includes the following main aspects:1.A distributed random sample partitioning algorithm is proposed: With the data representation model of random sample partition,a distributed random sample partitioning algorithm is proposed.This algorithm generates a partition number for each record,and the whole partition number is uniformly distributed.According to the partition number,records enter the corresponding partition.the algorithm is a typical Map Reduce process.It can also be applied to the Spark platform.It has good scalability.Through the verification experiments designed in this thesis,the feasibility and effectiveness of the algorithm are proved.With random sample partitioning,1% data's model,which just cost one tenth of the time costing,can approximate that of full amount of data.2.Random sample partition storage model is proposed: With the requirements for data analysis and modeling and the characteristics of random sample partitioning data representation model,this thesis designs a random sample partition storage model.A goal was reached to provide fast random access to partitions and related metadata.3.Developing a data management prototype system based on random samplepartitioning: Based on random sample partitioning and its storage model,a statistically sensible data management system(RSPDMS)is built to meet the needs of big data rapid analysis modeling.The horizontal scalability,system architecture and the enterprise's big data platform interface,have reached the expected goals.

Keywords/Search Tags:

Big data, Distributed data management system, Random sample partition, Random sample partition storage model

PDF Full Text Request

Related items

1	Research On Regression Acceleration Algorithm Based On Partition And Sampling
2	The Study Of Complex Data Processing Method Based On Classification
3	The Research Of Accelerated Learning Algorithms Based On Partition And Condensation
4	Research And Application Of The Partition Technology In Real-Time Data Warehouse
5	Research And Application Of The Partition Technology In Real-time Data Warehouse
6	Research On The Improvement Of Adaptive Random Testing Based On Restriction And Partition Strategies
7	Neural Networks For Small Sample Data Classification Intergraded With Decentralized Technology
8	Partition clustering of high dimensional low sample size data based on p-values
9	Research On Virtual Partition Strategies Of A Shared Storage Distributed Database
10	Dynamic Data Partition In Distributed Information Networking Database Management System