The Analysis And Improvement Of Decision Tree Algorithm On Information Gain

Posted on:2016-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Qi

Full Text:PDF

GTID:2348330479454403

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

The first appearance of the concept of data mining was in the knowledge discovery meeting in 1995, putted forward by Fayyad. He thinked that data mining is a process of knowledge discovery, is an automatic or semi-automatic process to find potentially useful data model from large amount of data. The study of data mining is more difficult on data collection at first, and handle a small amount of data can easily lead to excessive fitting model. As more and more people realizing the importance of data mining, businesses began to invest a lot of resources to build and maintain their own information systems to collect the available data. The rich amount of data not only provided us a large amount of data but brought some problems. Firstly, data volume is very big but not all of them are useful information, how to find out the data we need from these data is an important issue. Secondly, the storage of the large data is also a major challenge.Methods used in data mining mainly has the following categories, neural network, decision tree,regression and other methods for monitoring and forecasting models; Unsupervised model: cluster analysis(fast clustering and second-order clustering) and correlation analysis( multidimensional relevance and timing related); Dimensionality reduction for big data : principal component analysis, factor analysis, etc. Depending on the desired result, different analysis method can be obtained.This article mainly introduces ID3 algorithm and C4.5 algorithm in decision tree method which used to construct the monitoring and forecasting model,and analyzes the advantages and disadvantages of the two algorithms. The innovation point of this article is proposing correct information gain function on the base of ID3 algorithm. This correct algorithm can avoid the tend of choosing the attribute as the division of sample at a certain extent which attribute has more attribute value. Comparing the predictive ability of ID3 algorithm with the improved algorithm, the experimental results show that the improved algorithm has higher prediction accuracy.This article also analyzes some algorithms in data streams mining, the VFDT algorithm based on Hoeffding inequality, NIPDT algorithm for handling continuous attributes, VFDTb algorithm based on sorting binary tree, and VFDTc algorithm combines the thought of bayes classification. This article compare the speed of VFDT algorithm with VFDTc algorithm in data processing, and the experimental results show that VFDTc algorithm has faster processing speed.

Keywords/Search Tags:

Data mining, Decision tree, ID3, Data streams mining

PDF Full Text Request

Related items

1	Freight Invoice Based On Decision Tree Data Mining System
2	The Research Of Decision Tree Algorithm In Data Mining
3	Research On The Mining Algorithm Based On Data Streams
4	The Research Of Data Mining In Mobile Communication Enterprise Based On Decision Tree
5	Exporation And Research Of ODM Data Mining To Forest Management In Tahe
6	A Study Of Optimizing Data Mining Algorithms Based On Decision Tree
7	Research And Application On The Decision Tree Classification Algorithm Of Data Mining
8	Research And Implementation Of Data Mining Algorithms Based On SSAS
9	Analytical Study On Student Sources Of Vocational Colleges Based On Data Mining
10	Research And Application On The Data Mining Algorithm Based On Decision Tree