Fast Decision Tree Induction On High-dimensional Datasets

Posted on:2018-12-19

Degree:Master

Type:Thesis

Country:China

Candidate:J L Liu

Full Text:PDF

GTID:2348330512986745

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Data mining is a popular approach to extract knowledge and information from data.Decision tree is one of the most popular data mining algorithms.However,existing decision tree algorithms are not efficient enough to tackle high-dimensional data due to large computation load,or even infeasible because of large RAM(Random Access Memory)request.The thesis managed to design fast decision tree algorithms on high-dimensional datasets.Firstly,we proposed IBH(impurity-based heuristic schema)to cut down computation load of decision tree induction procedure.IBH takes advantage of computation results in parent nodes to approximate the upper bound of that in some child nodes,which results in less computation to find best splits in child nodes.We clarified the approach mathematically and concluded from lots of experiments:under high-dimensional cases,IBH can reduce around 70%of computation load both in single decision tree and ensemble methods,without negative impact on accuracy and concept conciseness.Secondly,to achieve optimal RAM usage and disk load,we proposed a parallel induction algorithm which divides datasets both horizontally and vertically.Compared to traditional horizontal division or vertical division methods,the approach cuts cluster's RAM usage from O(T)to O(?T),where T is number of parallel workers.At the same time,each worker's RAM usage drops from O(1)to O(1/?T),which means the bigger cluster,the less RAM needed for each worker.And the approach brings no negative impact on network throughput,disk load and CPU utility,and achieves good parallel efficiency in clusters with different scales.

Keywords/Search Tags:

High-dimensional datasets, decision tree, fast induction, heuristic, parallel computing

PDF Full Text Request

Related items

1	Decision Tree Classification Algorithm Parallelization And Its Application
2	Research And Application Of Clustering Algorithm On The High Dimensional Datasets
3	Research On Multi-specification Cargo Loading Based On Improved Random Forest Algorithm
4	A Research On Crisp Decision Tree Induction Algorithm Based On Classification Ambiguity
5	The Study On The Heuristic Algorithm For Generation Of Fuzzy Decision Tree
6	Research On Parallel Shared Decision Tree Algorithm Based On Hadoop
7	The Application Of SVM To Decision Tree Induction
8	The Application Of Svm To Decision Tree Induction
9	The Research Of Decision Tree Mining Based On Hadoop
10	The Research On Decision Tree Algorithm's Parallelization Based On Hadoop Platform