Font Size: a A A

Research On Multiple Decision Tree And Its Distribution Computing Theory Over Unbalanced Big Data

Posted on:2018-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ZhangFull Text:PDF
GTID:2348330536466296Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computers and the information technology,the data generated by various industries become big data and show imbalanced distribution characteristic.As an important technology of data mining,classification prediction method can be used to predict the developing trend of data,or find the potential values of the data.The traditional classification algorithms mainly consider the performance of accuracy,and ignore other aspects.This easily results in a poor classification performance over imbalanced data sets and loses some important information of minority class data include.Moreover,for big data sets,the traditional algorithms can not construct the prediction model or show low efficiency because the memory or storage limits of general computers.For solving these problems,this paper mainly focuses on the research of decision tree algorithm and proposes a multi-decision tree algorithm based on cost-sensitive attribute selection hybrid strategy and a novel distributed computing programming method for imbalanced big data sets.The main contributions include the following aspects:1.Firstly,a cost-sensitive attribute selection hybrid strategy is proposed by analyzing some traditional decision tree algorithms in detail.The strategy combines the attribute selection parameters of C4.5 algorithm and Gini coefficient of CART algorithm by linear combination.For the imbalanced data set,the cost-sensitive method is used for improving the classification performance of minority class.The experimental results showed that the proposed attribute selection measure can greatly improve the accuracy of minority class and ensure the classification performance of majority class at the same time.2.Secondly,an improved random forest multi-decision tree algorithm with full attribute information is proposed.In order to improve the classification accuracy of the decision tree algorithm and consider the influence of the root node information on the decision tree,this paper,based on the Random Forests(RF),improves the minority learning method and attributes selection measure.The training data and attributes are randomly selected of traditional RF method.The experimental results showed that the proposed multi-decision tree algorithm can give high classification accuracy.It guarantees the overall excellent performance of multi-decision tree.3.Moreover,a distributed storage and computing platform is designed and implemented.For unbalanced big data set,based on the existing hardware equipment,this paper constructs the Hadoop distributed storage and computing platform.The constructed platform ensures high reliability and high storagecapacity and efficient distributed computing ability.According to the distributed multi-decision tree algorithm proposed in this paper,the parameters of platform are set and optimized for platform performance in optimal situations.4.Lastly but not least,A novel distributed multi-decision tree algorithm computing model is proposed.By studying the relationship between algorithm accuracy,execution time and samples size of train data,it can be concluded that:a suitable size training sample could be set over different data sets.It can guarantee a good accuracy of the algorithm based on the sample size.According to this conclusion,this paper proposed a distributed multi-decision tree computing model by combining the coarse-grained calculation of MapReduce and the fine-grained calculation of the thread.The experimental results showed that the algorithm has the good performances of speedup and scalability.
Keywords/Search Tags:Imbalanced big data, classification, cost-sensitive, attribute selection hybrid measure, multi-decision tree prediction model, Distributed computing model
PDF Full Text Request
Related items