The Design And Implementation Of A Set Of Mathematical Statistics Functions Based On Hadoop

Posted on:2014-02-28

Degree:Master

Type:Thesis

Country:China

Candidate:C Peng

Full Text:PDF

GTID:2248330398470895

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated analysis to this data is becoming essential for marketplace competitiveness. Mathematical statistics analysis is a classical data analysis method, which uses the knowledge of mathematical statistics to analyze the data. With mathematical statistical analysis users can straightforwardly know the central、dispersed、distribution tendency of the data, what’s more, users can make inferences or extrapolations from a sample to the population based on the analysis of the sample.Stand-alone mathematical statistics algorithms can only deal with a limited size of data due to the memory limitations. In order to expand the data scale, this paper introduces a set of parallel mathematical statistics functions which bing a part of the "big cloud-parallel data mining tool"(BC-PDM) provides users with statistical analysis service based on the cloud platform by the means of SaaS (software-as-a-service).The main work of the paper is as follows:Firstly, the author did some research on the current popular mathematical statistics softwares such as SAS、IBM SPSS to determine which functions should be included in the set. According to the result of research, the set is divided into two subsets:the descriptive statistics subset and the inferential statistics subset. The descriptive statistics contains a funtion to analyze the quantitative characteristics of data. The inferential statistics subset consists of one-way ANOVA、unary linear regression、the test of the mean of a single normal population、 the test of mean difference between two normal populations、the test of paired data、univariate analysis and multivariate analysis.Secondly, the author studied the principle of each mathematical statistics function then designed and implemented the stand-alone algorithms. Next, the author designed and implemented the parallel algorithms based on the stand-alone algorithms using Hadoop MapReduce.Finally, the author did a lot of experiments to test functionality and performance of the parallel algorithms. The results of the experiments have shown that all the algorithms are correct, when dealing with small-scale data, parallel algorithms are not dominant compared with the stand-alone algorithms, however, with the expansion of data scale, the performance advantage of the parallel algorithms becomes more and more obvious.

Keywords/Search Tags:

mathematical statistics, data analysis, Hadoop, parallelalgorithm, SaaS

PDF Full Text Request

Related items

1	The Design And Implementation Of Log Statistics Analysis System Based On Hadoop
2	The Hadoop-based Statistics Of Mass Data On Huge Website And Its Application
3	Design And Implementation Of The Weibo Statistial System Based On Hadoop
4	Parallel Association Rules Algorithm Based On Hadoop
5	Research And Implementation Of Real-time Banking Statistics Report Based On Hadoop
6	Statistics And Analysis About User’s Data Of Surfing The Internet Based On J2EE
7	The Research And Implementation Of Brand Image Monitoring System On Hadoop
8	Design And Implementation Of The Saas Customer Data Survey And Analysis System
9	Research And Realization Of Application Software Of Data Statistics And Analysis
10	Study And Application On Optimizing Data Statistics And Analysis Of Manufacturing With Aptitude Statistics Platform