Font Size: a A A

Design And Implementation Of Uncertain Big Data Analysis Prototype System

Posted on:2015-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhangFull Text:PDF
GTID:2308330482954535Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the swelling of the amount of information, the rapid growth in the amount of data, the demang for big data analysis increasingly urgent. At the same time, peo-ple make profound understanding of the uncertainty of data, the demand for uncertain data analysis based on big data begin to rise. However, the existing study or system of Online Analysis Process based on data warehouse are all established in traditional database, it is im-perfect to support for big data. This thesis is rely on a uncertain big data project, implements a process of making the uncerain data be certain data on MapReduce computing framework. Design and implemet an Online Analysis Process System based on Hive data warehouse.This thesis mainly study the Monte Carlo sampling algorithm, through sampling process to make the uncertain graph data be certain data. Then construct multi-dimensional data mod-el on distributed data warehouse Hive, the data are from the certain data sets. Design and im-plement an OLAP system based on Hive data warehouse. This thesis’s contributions are summarized as follows. First, it makes a research on Simple Random Sampling algorithm and the feature of uncertain data sets, implements an effictive method called Unequal Probability Sampling algorithm. The idea of the algorithm is applied to the MapReduce model, the effec-tiveness of the algorithm is verified and the efficiency of the algorithm is improved. Second, this thesis studies the model and operation of Online Analysis Process. It makes use of dis-tributed data warehouse Hive under Hadoop platform, builds the multi-dimensional data model on Hive data warehouse according to analysis demands and defines the operations of multi-dimensional analysis. This thesis studies a virtual dimension analysis method called UDF based on Hive user-defined function. This method takes the feature of data sets into ac-count, can not only meet the analysis requirements but also be more effective. Thrid, this the-sis designs and implements a multi-dimensional data analysis system based on certain data. The system uses a three-tier framework, which are analysis engine layer, bussiness control layer and user action layer. The analysis engine layer is primarily responsible for the storage of data warehouse model and making use of MapReduce to execute analysis tasks. The bus-siness control layer is primarily responsible for the control of analysis process, including mis-sion define, dimension define and analysis execution. The user action layer is primarily re-sponsible for the implements of user interaction with the system, operate the analysis and query, display the analysis results by BS web construction.
Keywords/Search Tags:Big Data, Uncertain Data, Sampling Algorithm, Hive Data Warehouse, OLAP
PDF Full Text Request
Related items