Research On Parallel Shared Decision Tree Algorithm Based On Hadoop

Posted on:2014-06-01

Degree:Master

Type:Thesis

Country:China

Candidate:C Zhang

Full Text:PDF

GTID:2268330425484245

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Shared knowledge, which focuses on the problems of knowledge discoveryshared by two (or more) applications/datasets, can help users investigate a poorlyunderstood dataset (PD), through research by analogy and transferring knowledgefrom a well understood dataset (WD). As one of important shared knowledge types,shared decision trees (a) are highly accurate in both, and (b) have highly similarbehavior in, WD and PD. However, the existing algorithms, which based onserialization, could not be able to meet the need of the rapid growth of data, and hadlow efficiency in big datasets. So this paper focuses on the parallelization of sharedknowledge mining algorithm, and the main work and contributions are as follow:(1)In order to solve the low efficiency problem of serial shared knowledgealgorithm when dealing with big dataset, by the introduction of cloud computingtechnology and the parallel idea of decision tree algorithm, a parallel shared decisiontree algorithm (PSDT) based on Hadoop was proposed. In order to achieve theattribute parallelism and node parallelism, the MapReduce model and attribute-liststructure are used in PSDT. Meanwhile, the attribute-list was pre-sorted parallelly byusing the unique sorting mechanism of MapReduce model. Experimental resultsshow that, compared to the serial shared decision tree algorithm (SDT), PSDTalgorithm is able to handle larger data and has good scalability; when dealing withlarge-scale datasets, the efficiency of PSDT algorithm is significantly higher than itof SDT algorithm.(2)In order to solve the performance bottleneck of Hadoop cluster, from theperspective of reducing I/O operations, a novel hybrid data structure was introducedby using the strategy of “CPU for I/O”. The parallel shared decision tree algorithmbased on hybrid data structure (HPSDT) applies attributes-list to compute the splitindicators parallelly, and data-records structure to split in the splitting procedure.Compared to the traditional attribute-list structure, the hybrid data structure not onlyreduces the data redundancy but also simplifies the splitting process, and greatlyreduce the I/O operations. HPSDT simplifies the splitting process, and itsoperations of I/O are0.34times longer than that of PSDT. The experimental resultsshow that HPSDT have good parallelism and scalability.(3)Compare the time performance of HPSDT algorithm and that of PSDT algorithm, the experimental results show that the performance of HPSDT is betterthan that of PSDT. The ratio of runtime between PSDT and HPSDT reaches to2.45when the dataset size is917M. Especially, the superiority is even more obviouslywith the increase of the dataset size.

Keywords/Search Tags:

Shared Knowledge, Parallel Shared Decision Tree, Hybird DataStructure, Cloud Computing, Hadoop

PDF Full Text Request

Related items

1	The Research On Decision Tree Algorithm's Parallelization Based On Hadoop Platform
2	The Research Of Decision Tree Mining Based On Hadoop
3	The Parallel Reseach On Decision Tree Classification Algorithm Based On Hadoop
4	Naplus: A Software Shared Memory For Virtual Clusters
5	Research Of Cloud Computing Based On The Shared-nothing Architecture Parallel File System
6	Research On Parallel Decision Tree Algorithm Based On Hadoop Platform
7	Research Of Attribute-based Encryption Schemes Based On Shared Sub-policy In Cloud Computing
8	The Study Of Public Auditing For Shared Data In The Cloud
9	A Parallel Programming Language With Shared Resource Declaration Design And Front-end Implement
10	A Fast Parallel Topic Modeling Algorithm In Shared Memory System