Font Size: a A A

The Analysis And Improvement Of Decision Tree Algorithm On Information Gain

Posted on:2016-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:M Y QiFull Text:PDF
GTID:2348330479454403Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The first appearance of the concept of data mining was in the knowledge discovery meeting in 1995, putted forward by Fayyad. He thinked that data mining is a process of knowledge discovery, is an automatic or semi-automatic process to find potentially useful data model from large amount of data. The study of data mining is more difficult on data collection at first, and handle a small amount of data can easily lead to excessive fitting model. As more and more people realizing the importance of data mining, businesses began to invest a lot of resources to build and maintain their own information systems to collect the available data. The rich amount of data not only provided us a large amount of data but brought some problems. Firstly, data volume is very big but not all of them are useful information, how to find out the data we need from these data is an important issue. Secondly, the storage of the large data is also a major challenge.Methods used in data mining mainly has the following categories, neural network, decision tree,regression and other methods for monitoring and forecasting models; Unsupervised model: cluster analysis(fast clustering and second-order clustering) and correlation analysis( multidimensional relevance and timing related); Dimensionality reduction for big data : principal component analysis, factor analysis, etc. Depending on the desired result, different analysis method can be obtained.This article mainly introduces ID3 algorithm and C4.5 algorithm in decision tree method which used to construct the monitoring and forecasting model,and analyzes the advantages and disadvantages of the two algorithms. The innovation point of this article is proposing correct information gain function on the base of ID3 algorithm. This correct algorithm can avoid the tend of choosing the attribute as the division of sample at a certain extent which attribute has more attribute value. Comparing the predictive ability of ID3 algorithm with the improved algorithm, the experimental results show that the improved algorithm has higher prediction accuracy.This article also analyzes some algorithms in data streams mining, the VFDT algorithm based on Hoeffding inequality, NIPDT algorithm for handling continuous attributes, VFDTb algorithm based on sorting binary tree, and VFDTc algorithm combines the thought of bayes classification. This article compare the speed of VFDT algorithm with VFDTc algorithm in data processing, and the experimental results show that VFDTc algorithm has faster processing speed.
Keywords/Search Tags:Data mining, Decision tree, ID3, Data streams mining
PDF Full Text Request
Related items