Font Size: a A A

Fast Decision Tree Induction On High-dimensional Datasets

Posted on:2018-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiuFull Text:PDF
GTID:2348330512986745Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a popular approach to extract knowledge and information from data.Decision tree is one of the most popular data mining algorithms.However,existing decision tree algorithms are not efficient enough to tackle high-dimensional data due to large computation load,or even infeasible because of large RAM(Random Access Memory)request.The thesis managed to design fast decision tree algorithms on high-dimensional datasets.Firstly,we proposed IBH(impurity-based heuristic schema)to cut down computation load of decision tree induction procedure.IBH takes advantage of computation results in parent nodes to approximate the upper bound of that in some child nodes,which results in less computation to find best splits in child nodes.We clarified the approach mathematically and concluded from lots of experiments:under high-dimensional cases,IBH can reduce around 70%of computation load both in single decision tree and ensemble methods,without negative impact on accuracy and concept conciseness.Secondly,to achieve optimal RAM usage and disk load,we proposed a parallel induction algorithm which divides datasets both horizontally and vertically.Compared to traditional horizontal division or vertical division methods,the approach cuts cluster's RAM usage from O(T)to O(?T),where T is number of parallel workers.At the same time,each worker's RAM usage drops from O(1)to O(1/?T),which means the bigger cluster,the less RAM needed for each worker.And the approach brings no negative impact on network throughput,disk load and CPU utility,and achieves good parallel efficiency in clusters with different scales.
Keywords/Search Tags:High-dimensional datasets, decision tree, fast induction, heuristic, parallel computing
PDF Full Text Request
Related items