Font Size: a A A

Improvement Of Decision Tree Algorithm Based On Hadoop And Research On Classification And Prediction Of Forestry Data

Posted on:2017-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:H T LiFull Text:PDF
GTID:2308330491954677Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the data scale has growed exponentially, and it contains a lot of information in vast amounts of data, which is needed to analyze the value. Traditional memory-resident data mining algorithms when dealing with massive data has been limited by the single machine performance problems, but Hadoop provides an effective solution for processing huge amounts of data because of its massive data storage system and parallel programming framework.The forest resources in our country are relatively rich, after many years of monitoring and sorting, the basis data of forest resources has begun to take shape, and has the characteristics of high-dimensional, noise, large scale etc.. But the traditional methods have been getting more and more weak in dealing with forestry data analysis, which has been unable to meet the needs of forestry, and we are eagerly in need of a scientific and efficient technology to meet the demand.Based on the above, an imprecise probability C4.5 (Imprecise Probability C4.5, IP-C4.5) algorithm based on Hadoop was proposed. The improved algorithm could optimize the influence of the unreliability of data set and had the ability to handle massive data. At the same time, the improved C4.5 algorithm was introduced into forestry application on classification and prediction of forestry data for forest maturity and type of forest cover, thus a new pattern could have been opened up for forestry data analysis in the future.In this paper, the specific research contents are as follows:(1) The C4.5 decision tree algorithm was chosed to study and improve, and the J48 code was used for research learning in open source software called Weka. When the improved C4.5 algorithm selected the split attribute, an improved selecting split criterion based on imprecise probability information gain rate was chosed instead of the original, it would be more suitable for dealing with noisy data set.(2) The cloud computing technology and mainly the HDFS file system and parallel programming framework called MapReduce based on Hadoop were studied, then an parallel design based on file split for improved algorithm was proposed when calculating the attribute selection criterion in the way of splitting the data lengthways combined with traditional algorithm model of decision tree algorithm. At the expense of not sacrificing classification accuracy, the parallelization of the improved algorithm based on Hadoop had the high efficiency and scalability in dealing with massive data.(3) Forestry data has the characteristics of high-dimensional, more noise, and mass etc., then the improved C4.5 algorithm is suitable for dealing with noisy data, and the parallel C4.5 algorithm based on Hadoop is suitable for processing massive data, so the final program was applied to deal with forestry data. Experiments have been done from two aspects of open test and close test based on a set of forest sub-compartment data set on prediction of the forest maturity, and a large number of forest cover type data set called Covertype from UCI international machine learning database was used to establish decision tree model for predicting the type of forest cover.Finally, experimental results show that the improved algorithm has higher accuracy when dealing with noisy data, at the same time, the parallelization improved algorithm on the classification accuracy also has no loss, and has absolute advantage in dealing with massive data, such as the ideal speedup and efficiency. In terms of forestry data classification prediction, the improved algorithm has higher accuracy and better time advantage.
Keywords/Search Tags:Hadoop, decision tree, uncertain probability, noisy data, forestry data
PDF Full Text Request
Related items