Font Size: a A A

Research On Parallel Shared Decision Tree Algorithm Based On Hadoop

Posted on:2014-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2268330425484245Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Shared knowledge, which focuses on the problems of knowledge discoveryshared by two (or more) applications/datasets, can help users investigate a poorlyunderstood dataset (PD), through research by analogy and transferring knowledgefrom a well understood dataset (WD). As one of important shared knowledge types,shared decision trees (a) are highly accurate in both, and (b) have highly similarbehavior in, WD and PD. However, the existing algorithms, which based onserialization, could not be able to meet the need of the rapid growth of data, and hadlow efficiency in big datasets. So this paper focuses on the parallelization of sharedknowledge mining algorithm, and the main work and contributions are as follow:(1)In order to solve the low efficiency problem of serial shared knowledgealgorithm when dealing with big dataset, by the introduction of cloud computingtechnology and the parallel idea of decision tree algorithm, a parallel shared decisiontree algorithm (PSDT) based on Hadoop was proposed. In order to achieve theattribute parallelism and node parallelism, the MapReduce model and attribute-liststructure are used in PSDT. Meanwhile, the attribute-list was pre-sorted parallelly byusing the unique sorting mechanism of MapReduce model. Experimental resultsshow that, compared to the serial shared decision tree algorithm (SDT), PSDTalgorithm is able to handle larger data and has good scalability; when dealing withlarge-scale datasets, the efficiency of PSDT algorithm is significantly higher than itof SDT algorithm.(2)In order to solve the performance bottleneck of Hadoop cluster, from theperspective of reducing I/O operations, a novel hybrid data structure was introducedby using the strategy of “CPU for I/O”. The parallel shared decision tree algorithmbased on hybrid data structure (HPSDT) applies attributes-list to compute the splitindicators parallelly, and data-records structure to split in the splitting procedure.Compared to the traditional attribute-list structure, the hybrid data structure not onlyreduces the data redundancy but also simplifies the splitting process, and greatlyreduce the I/O operations. HPSDT simplifies the splitting process, and itsoperations of I/O are0.34times longer than that of PSDT. The experimental resultsshow that HPSDT have good parallelism and scalability.(3)Compare the time performance of HPSDT algorithm and that of PSDT algorithm, the experimental results show that the performance of HPSDT is betterthan that of PSDT. The ratio of runtime between PSDT and HPSDT reaches to2.45when the dataset size is917M. Especially, the superiority is even more obviouslywith the increase of the dataset size.
Keywords/Search Tags:Shared Knowledge, Parallel Shared Decision Tree, Hybird DataStructure, Cloud Computing, Hadoop
PDF Full Text Request
Related items