Font Size: a A A

Research And Implementation Of Distributed Decision Trees Algorithms In Classification Problems

Posted on:2019-07-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S MuFull Text:PDF
GTID:1368330572953462Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Classification is one of important research tasks in many data mining areas such as pattern recognition,machine learning,image processing and information retrieval.Decision trees are one of the most effective and widely used models because of its higher accuracy,fewer parame-ters and better comprehensibility.In many real world applications such as business,health care,manufacturing and production,financial analysis,remote sensing image classification,molecu-lar biology,etc.,decision trees have been extensively applied and achieved a significant result.With the deepening science and technology,and the development of internet,the amount of data generated in many business fields is increasing extremely quickly.It is a new challenge and opportunity for decision trees to resolve large-scale data classification problems.However,the traditional decision trees cannot be directly applied to large-scale data classification problems because of some bottle-necks such as memory restrictions,time complexity and data complexi-ty.In order to make an effective analysis and processing on large-scale data set,the importance of distributed algorithms is increasingly significant.This thesis investigates decision trees and their distributed implementations in classification problems,and the research mainly includes the following topics:(1)In order to meet the challenges of C4.5 decision trees happened in large-scale classi-fication problems,the thesis proposes a distributed C4.5 decision tree.The proposed approach implements the distributed computation on each tree node building in the framework of Map-Reduce.To build the tree nodes,there are two distributed methods in the proposed decision trees.One is used to confirm the best splitting attribute and the best splitting point,and the other one is employed to partition the data.First,the original data are randomly divided into a number of sub-data sets in the framework of Map-Reduce.Then,in the Map phase,the information gain for each attribute of each sub-data set is calculated to determine the best splitting point.Furthermore,the information gain ratios for all the attributes are computed according to their best splitting points.Finally,in the Reduce phase,the sum of information gain ratios delivered from each Map operation and the average point or union set of the splitting points delivered from each Map operation are computed,and the attribute owning the largest sum of information gain ratios and the average point or union set of the splitting points are chose to construct a tree node.Moreover,in order to address the over-partitioning problem,the depth of decision tree,the num-ber of samples and the maximal class probability in each tree node are used as the termination conditions.In the experimental studies,comparing with the classical C4.5 algorithm on 20 UCI data sets,the feasibility and the effectiveness are verified firstly.Meanwhile,three evaluation indexes are studied to show the good parallel performance.(2)During the tree node splitting of classic rank mutual information based decision trees,all conditional attributes and attribute values need to be traversed for the computing of rank mu-tual information,but the computing is very time consuming.To address the time consuming problem and improve the ability to deal with large-scale data sets,this study establishes a fast rank mutual information based decision tree and designs a distribution implementation.This method improves the velocity from two aspects.First,the proposed algorithm generates the candidate attributes by a max-relevance and min-redundancy criterion to remove the redundant attributes in each tree node building.Then,the fuzzy c-means algorithm is employed to confirm the candidate splitting points.Finally,the computing of rank mutual information only traverses the candidate attributes and candidate splitting points.Meanwhile,a parallel implementation is developed in the framework of Map-Reduce.In contrast to the traditional decision trees and traditional monotonic decision trees,the experimental results indicate the feasibility and effec-tiveness.Through the experimental analysis on large-scale data sets,the parallel implementation is feasible and has a good parallel performance.(3)In order to enlarge the applications of Pearson's correlation coefficient and enrich the researches of decision trees,a new Pearson's correlation coefficient based decision tree is pro-posed and its parallel implementation is developed in this paper.In the proposed method,a new impurity measure is established using Pearson's correlation coefficient through a map be-tween conditional attribute and decision attribute.Then,the measure is employed to design a new splitting rule to confirm the optimal splitting attributes and splitting points in the growth of decision trees.Besides,in order to improve the ability to cope with large-scale data classifica-tion,a parallel implementation in the framework of Map-Reduce is developed.Through several evaluation indexes,the experimental results show that the proposed method has a good compara-bility.Furthermore,the studies on large-scale data sets verify the feasibility and its good parallel performance,especially on reducing computational time and avoiding memory restriction.
Keywords/Search Tags:Decision Trees, Distributed Computation, Classifier Design, Classification Problems
PDF Full Text Request
Related items