Font Size: a A A

Research On Decision Tree Classification Algorithm Based On Meta-learning

Posted on:2010-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:L YuFull Text:PDF
GTID:2178360272997168Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the world is at"data explosion but information poor"time. In the interests of commercial driven, data mining came into being. In a broad sense, data mining is refered to a non-trivial process which obtains effective, novel, potentially useful, and ultimately understandable patterns from massive amount of data. Data mining is general cross-discipline which involves machine learning, mathematical statistical analysis, database technology, visualization techniques, pattern recognition, high performance computing and many other aspects. Classification is a prediction model technology and an important aspect of data mining, which is widely used in many fields. Mining classification rules is an important form of data analysis, which can be used to extract and describe model of data type and predict the future data trend. Classification researches have been developing rapidly in foreign countries and have been forming a lot of algorithms and models. But its development is lagging behind in our country. So research data classification has great significance for data mining. Study shows that in the time of today's ever-expanding amount of data, algorithm characteristics, including implementation speed, expansibility and understandable output, are extremely important. In the most classification algorithms, decision tree algorithm is used the most widely in the huge amount of data environment for its fast, easy, simple, convenient features. Decision tree method is a more common and in-depth study of the classification function approximation method, and is a common prediction model, which has the purpose of classifying a large amount of data in order to find the value of a number of potential information. The key to structure a decision tree as small as possible is adopting heuristic strategy to determine best selection or attribute which bases on segmentation criteria. The selection of attribute bases on a heuristic rule or a statistical measure, and the most common segmentation criterias are the information gain and the gini index.SPRINT is one of the outstanding algorithms, it has many advantages than other data mining algorithms, including not limited by memory, achieving parallelism easily, generating more compact and accurate decision trees, and better scalability, acceleration and expansion. The basic strategy of the algorithm is greedy approach to generate decision trees by recursive top-down layer by layer, all the way to break. The article firstly study the SPRINT algorithm, and implement it by C++. Secondly it makes a further improvement on SPRINT algorithm. Through the introduction of a dynamic data structure, the improved algorithm can reduce storage space occupied by the attribute lists and nodes splitting operation time required. Because SPRINT uses attribute lists so it has three major disadvantages: triple storage price, need to create a hash table by node partition, increase the burden of system. In accordance with the above-mentioned shortcomings of SPRINT algorithm, this article presents the improved method which change the attribute list into < continuous attribute, a number of discrete attributes, class type, record-index> form of a new uncertain combination. The benefits of doing so are: 1. the number of new attribute lists is only related to the number of the continuous attribute, which can save hard disk space, 2. attributes which at the same attribute list no longer have to split in accordance with the hash table index. The test result shows that the improved algorithm can increase the speed of generating tree at a certain extent, and can save some storage space.With rapid development of the related disciplines and modern information society of the acceleration data and the use of databases increased rapidly, especially in a variety of network's extensive use of Internet, distributed data mining come into being in this background. It is a integration between data mining technology and distributed computing. It adopts idea and technology of traditional centralized mining algorithm, but taking into account the system's distributed environment, the algorithm should have a good parallelism and scalability, which is mainly used on data model found under distributed environment. Decision trees classification algorithm is applied to distributed data mining, which is beneficial to the better application of data mining technology in the current environment. This paper presents the improved algorithm work process in the distributed environment. It explores the work process of the improved SPRINT algorithm under the distributed environment, and extends the improved SPRINT algorithm with the strong scalability to the distributed environment: the various processors have the corresponding hierarchical relation, that is, it handles data only with the same level or adjacent upper or lower processor, but to other processors it is transparent.In order to achieve the distributed data of isomorphic nodes, researchers have proposed meta-learning, cooperative learning and other methods, and meta-learning is the most representative. Meta-learning adopts ensemble learning approach to generate the final prediction of the overall situation, that is meta-classifier. The basic idea is to have the knowledge gained from further study in order to get the final data model. Finally this paper explores the improved SPRINT algorithm based on meta-learning isomorphism distributed environment, and implements it by multi-thread simulation. It uses meta-learning method to work together to get the appropriate rules of decision tree classification. The main idea is: on any number of distribution sites in theory, use SPRINT algorithm to extract models or sub-models from isomorphism recordings, each sub-model corresponds to one or several partial results, make them as the meta-training sets,and then use meta-learning strategy to get the best overall decision tree model.1.On the meta knowledge discovery stage: the local database select sample sets randomly, each sample set is randomly assigned to each site, all sites using improved SPRINT algorithm to get the partial results.2.Use the result of the first step, and use meta-learning mechanism to integrate them to form a meaningful global result. The test shows that the implementation results based on meta-learning algorithm is more or less the same with the results under the centralized circumstance, and it has a certain degree accuracy, increase the speed of creating tree. at some extent, and improve the utilization of the distributed environment. However, with the increase in the number of threads, there is a slight increase in time. The reason is that the experiment simulate the multi-thread with single-core CPU, data exchange between the systems take up time, and is fundamentally parallel by the implementation.In this paper, experimental data sets are quoted from the well-known UCI machine learning database which is evaluation criteria widely used in the classification field. It further shows that the SPRINT algorithm, the improved SPRINT algorithm, and the isomorphism distributed improved SPRINT algorithm based on meta-learning have the feasibility and effectiveness to some extent.
Keywords/Search Tags:data mining, classification, decision tree, SPRINT algorithm, meta-learning
PDF Full Text Request
Related items