Font Size: a A A

A Distributed Data Management System For Data Analysis

Posted on:2020-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y X WuFull Text:PDF
GTID:2428330599477509Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Industry 4.0 and Internet+,big data has become a common phenomenon in the industry,and it has gradually affected the lives of the public in all aspects.However,as the amount of data and the complexity of the model increase,the current system is becoming incapable and inefficient.It is of certain value and practical significance to study how to analyze and model massive data quickly and at low cost.Nowadays it is still believed that analyzing the full amount of data,which costs a lot,is the only way to fully analyze the data.The various overheads of existing distributed systems make the contradiction between the urgency and difficulty of big data analysis become more and more prominent,until the Random Sample Partition(RSP)study shows that analyzing a subset of data similar to the distribution of full data can achieve the effect of approximating the full amount of data.This theis is on a distributed data management system for data analysis based on random sample partition.A new method of big data management,which is based on random sample partition,is developed to provide fast and flexible support for data analysis modeling.This thesis includes the following main aspects:1.A distributed random sample partitioning algorithm is proposed: With the data representation model of random sample partition,a distributed random sample partitioning algorithm is proposed.This algorithm generates a partition number for each record,and the whole partition number is uniformly distributed.According to the partition number,records enter the corresponding partition.the algorithm is a typical Map Reduce process.It can also be applied to the Spark platform.It has good scalability.Through the verification experiments designed in this thesis,the feasibility and effectiveness of the algorithm are proved.With random sample partitioning,1% data's model,which just cost one tenth of the time costing,can approximate that of full amount of data.2.Random sample partition storage model is proposed: With the requirements for data analysis and modeling and the characteristics of random sample partitioning data representation model,this thesis designs a random sample partition storage model.A goal was reached to provide fast random access to partitions and related metadata.3.Developing a data management prototype system based on random samplepartitioning: Based on random sample partitioning and its storage model,a statistically sensible data management system(RSPDMS)is built to meet the needs of big data rapid analysis modeling.The horizontal scalability,system architecture and the enterprise's big data platform interface,have reached the expected goals.
Keywords/Search Tags:Big data, Distributed data management system, Random sample partition, Random sample partition storage model
PDF Full Text Request
Related items