Font Size: a A A

Research And Implementation Of Marine Information OLAP And Data Mining System Based On Hadoop

Posted on:2015-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:J H QiuFull Text:PDF
GTID:2308330482457279Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The 21st Century is the century of ocean, coastal countries all over the world take national marine rights preserving, marine economy developing and marine ecological environment protecting as important development strategies. The National Oceanic Information Center proposed the "Digital Ocean" development strategy in 1999. Ocean OLAP and ocean data mining are integral parts of "Digital Ocean". Knowledge found by OLAP and data mining technologies are useful for marine environment protection, marine meteorological observation, marine meterological prediction, marine disaster prevention and marine mitigation. With the development of information technology, it’s more and more convenient to get marine data. Our country has accumulated large scale of marine data, one of the chanllenges is to analyze these data efficiently. This thesis designs and implements marine information OLAP and data mining system based on Hadoop according to the requirements of the National Oceanic Information Center’s project "research on marine cloud computing and cloud service". And this thesis also proposes methods to optimize the OS-ELM algorithm in the system.There are two funtions in the system which adopts B/S architecture, one is OLAP, and the other is data mining tool. In the OLAP sub-system, Hive is used for underlying storage, which is a distributed data warehouse. A class of HiveDialect is added to open-source OLAP engine Mondrian to make it able to translate MDX queries to HiveQL statements. These HiveQL statements can be execuated in parallel in the cloud system, and thus can improve the efficiency of OLAP analysis. In the data mining sub-system, a web user interface is added to Mahout to make users do data mining jobs more conveniently. POS-ELM, which is a more efficient and faster parallel classification algorithm is added as a supplement to the sub-system. In the OLAP sub-system, user can execuated OLAP operations such as multi-dimensional query, drill-down, roll-up, slice, dice and shaft for data stored on the cloud platform. In the data mining sub-system, users can select suitable parallel data mining methods to discover information in marine data conveniently.The OS-ELM classification algorithm in data mining sub-system usually has low accuracy dealing data with high dimensions and lots noise. This thesis proposes a random subspace ensemble classification algorithm for OS-ELM (RSEOS-ELM). After analyzing the dependencies between matrices computing during the training phase of RSEOS-ELM, This thesis finds out the matrices that can be computed in parallel and proposes parallel ensemble classification algorithm for RSEOS-ELM (PRSEOS-ELM) using the MapReduce programming framework. Compared with RSEOS-ELM algorithm, the accuracy of PRSEOS-ELM is almost as same level as PRSEOS-ELM while the training speed is much faster. Meanwhile the PRSEOS-ELM which is based on MapReduce has good scalability for large scale data sets. For example, the training time of PRSEOS-ELM is 2 orders of magnitude lower than that of RSEOS-ELM dealing with data set with 640,000 training samples and 40960 dimensions or 409,600,000 training samples and 64 dimensions to construct the same number of ensemble members. And the speedup of PRSEOS-ELM rises as the number of cores increases. The speedup of this algorithm can reach as high as 40 on a cluster with maximum 80 cores.
Keywords/Search Tags:Hadoop, Online Analytical Processing, Data Mining, Extreme Learning Machine, Ensemble Classification
PDF Full Text Request
Related items