Font Size: a A A

Improved Decision Tree Algorithm With Abilities Of Dimension Reduction And Noise-free

Posted on:2016-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:W WangFull Text:PDF
GTID:2308330461951372Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, a wide range of high-dimensional data appear in the social sciences and natural sciences, on the one hand, it brings more available information, on the other hand, the processing and analysis of data encounter a great challenge. Especially with the wide application of data mining techniques, the fact that data mining classification results are sensitive to noise, has become a problem which can not be ignored, further optimization of existing classification techniques is urgently needed.In order to improve the ability to predict high-dimensional data in high-noise environments using decision tree classification algorithm, this paper put C4.5 decision tree algorithm as the optimized object for an in-depth research. Using the idea of Noise-free Principal Component Analysis(NFPCA) algorithm to improve the traditional C4.5 algorithm, and in this paper we are proposed to NFPCA-in-C4.5 algorithm to solve high noise problems of high-dimensional data which result in a decline in tree prediction accuracy. The main work includes:(1) This paper gives a detailed analysis of the mechanism of using PCA algorithm to reduce dimensions, and reasons of that the results is still polluted by noise in the principal component space under high-dimensional data with noise; Considering the factors of high-dimensional and high noise to influence the decision tree classification model to predict the effect, use NFPCA algorithm thought the noise control problem of high-dimensional data into a feature to fit the data and control the smoothness of the combination optimization problem, the optimization problem in line with regularized least squares problem definition, obtained by solving the principal component space is relatively noise-free, so not only reduces the dimension, but weaken the influence of the noise.(2) In the process of constructing a decision tree model, take advantage of the top-down way to recursively building decision tree node. First, when constructing a parent node, we use NFPCA algorithm to convert the original data space into the principal component space; then divided the principal component of data sets to choose the split attribute based on information entropy method. At last, when constructing a child node, convert data subsets back to the original data space. Depend on the transition between the original data space in parent node and the principal component space in child nodes, we avoid loss of information during dimensionality reduction, reducing the impact of the loss of information on the prediction accuracy of the algorithm C4.5.This paper compares the accuracy changes and the size changes of prediction model, through comparison experiments between C4.5 and NFPCA-in-C4.5 algorithm, to reflect the advantageous performance of NFPCA-in-C4.5 algorithm. Experimental results show that NFPCA-in-C4.5 algorithm which this paper proposes use high-noise characteristics of high-dimensional data, and make full use of the relationship between parent data sets and child node data sets, integrate the noise reduction processing of NFPCA into the construction processing of C4.5 algorithm, In the processing of constantly building node, we reached the purposes of dimensionality reduction and noise reduction, this algorithm changes the traditional noise reduction only as a preprocess, making NFPCA-in-C4.5 algorithm owning capabilities of dimensionality reduction and noise tolerance, improved the robustness of decision tree algorithm, and significantly avoid accuracy reduction problems of prediction models caused by feature information loss during dimensionality reduction and caused by residual noise, this algorithm ensure the simplicity of high-dimensional data with high noise environments and have stability of predictions and size of model structure.
Keywords/Search Tags:High-dimensional data, Noise-free, Principal component analysis, Decision tree algorithm
PDF Full Text Request
Related items