Font Size: a A A

The Design And Implementation Of A Set Of Mathematical Statistics Functions Based On Hadoop

Posted on:2014-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:C PengFull Text:PDF
GTID:2248330398470895Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated analysis to this data is becoming essential for marketplace competitiveness. Mathematical statistics analysis is a classical data analysis method, which uses the knowledge of mathematical statistics to analyze the data. With mathematical statistical analysis users can straightforwardly know the central、dispersed、distribution tendency of the data, what’s more, users can make inferences or extrapolations from a sample to the population based on the analysis of the sample.Stand-alone mathematical statistics algorithms can only deal with a limited size of data due to the memory limitations. In order to expand the data scale, this paper introduces a set of parallel mathematical statistics functions which bing a part of the "big cloud-parallel data mining tool"(BC-PDM) provides users with statistical analysis service based on the cloud platform by the means of SaaS (software-as-a-service).The main work of the paper is as follows:Firstly, the author did some research on the current popular mathematical statistics softwares such as SAS、IBM SPSS to determine which functions should be included in the set. According to the result of research, the set is divided into two subsets:the descriptive statistics subset and the inferential statistics subset. The descriptive statistics contains a funtion to analyze the quantitative characteristics of data. The inferential statistics subset consists of one-way ANOVA、unary linear regression、the test of the mean of a single normal population、 the test of mean difference between two normal populations、the test of paired data、univariate analysis and multivariate analysis.Secondly, the author studied the principle of each mathematical statistics function then designed and implemented the stand-alone algorithms. Next, the author designed and implemented the parallel algorithms based on the stand-alone algorithms using Hadoop MapReduce.Finally, the author did a lot of experiments to test functionality and performance of the parallel algorithms. The results of the experiments have shown that all the algorithms are correct, when dealing with small-scale data, parallel algorithms are not dominant compared with the stand-alone algorithms, however, with the expansion of data scale, the performance advantage of the parallel algorithms becomes more and more obvious.
Keywords/Search Tags:mathematical statistics, data analysis, Hadoop, parallelalgorithm, SaaS
PDF Full Text Request
Related items