Hadoop-based Parallel Algorithm For Mining

Posted on:2014-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:H Wang

Full Text:PDF

GTID:2268330398495941

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of network technology, the mobile Internet technology, social networking and sensor technology, there are many intelligent terminals which can quickly generate and disseminate data come forth. These terminals are producing large amounts of data needed to storage for long at an alarming rate; of which there are more than80%is unstructured data, and which is constantly increasing, and non-hotspot data may also be accessed. Compared with the traditional data storage based on relational database, there are significant differences. These differences make the traditional data storage and management solutions cannot be competent to the era of "big data" analysis, management and mining tasks. In addition, the data in the era of big data increasing at all time, the traditional data mining solution has been unable to adapt to the almost unlimited expansion of the data set analysis tasks.The core issues of the era of big data are that data mining technology filters to smelting of precious metals. Processing and analysis of these data and extracting useful information by the form of cloud computing, has become an important research direction in the field of data mining. And Hadoop is the Apache’s open source software, which provides a distributed file system and the computing framework of MapReduce. It includes the infrastructure of the cloud computing software platform and integrated a set of components, such as databases, data warehouses. Hadoop has become an academia and industry standard platform for cloud computing research and applications. This paper focuses on Hadoop software framework, such as the core architectures and operating mechanisms of HDFS, MapReduce, HBase and other components, and then analyzes the shortcomings of the framework, such as the single point failure of HDFS and MapReduce. Then give the corresponding solutions, and based on this to build a highly reliable and secure Hadoop environment. Combined with the characteristics of the traditional classification and clustering algorithms, give a cloud-based data mining system design. Describe the level functions of the system in detail, especially the classification and the clustering modules, which mainly include the following few points:In this paper, we have surveyed and researched the single-point failures and the security of HDFS NameNode and MapReduce JobTracker, studied deeply the major components in the Hadoop ecosystem, given the corresponding solutions, build a highly reliable and secure Hadoop cluster.And the architecture of hadoop-based data mining platform is studied by analyzing the feature of big data analysis techniques and the characteristics of the traditional data mining system.Then, based on the study above, Hadoop-based classification algorithms were designed and developed, mainly described the design and implementation of Naive bayes and SVM, and then given the related experiments. So does the clustering algorithms.

Keywords/Search Tags:

Hadoop, Classification, clustering, Naive Bayes, K-Means, Parallel Computing, DRBD, Kerberos

PDF Full Text Request

Related items

1	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop
2	Research And Application On Naive Bayes Classification Algorithm
3	Research And Implement On Data Mining Algorithm Parallel Based On Hadoop
4	Naive Bayes Classification And Application Based On Improved K-means Algorithm
5	Text Categorization Based On Naive Bayes Method
6	Parallel Clustering Algorithm Based On MapReduce
7	The Research And Application Of Text Classification Based On Cloud Computing
8	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
9	Research On Classification Of Web User Access Preferences Based On Hadoop
10	Research And Application Of Na(?)ve Bayesian Classification Model Based On Clustering Algorithms