Font Size: a A A

Research And Implementation On Larger Data Sets Mining Algorithm Based On Rough Set

Posted on:2011-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhouFull Text:PDF
GTID:2178360302993842Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology, sensor technology and the internet, there are a lot of effective data tools for generation, transmission, storage and retrieval. Therefore, as the increase of data rate and size that we captured, a variety of data streams were recorded in the various types of storage media. Because of the rapid growth of data in the number of instances, properties and classification, it appears high-dimensional data sets, which bring some tremendous challenges in robustness and scalability to many machine learning algorithms including decision tree classification mining algorithm.In this thesis, we firstly explain the background and significance,then discuss the related principles and theories of decision tree classification and rough set.We introduce rough set theory into the preparation of training set and the model construction of decision tree,and start with reducing size of large data sets and improving node attribute selection measure of decision tree.We perpetrate several in-depth researches and fruitful innovations, the main content and innovation are described as follow:1. The data set size compressing algorithm is too complex and cutting instances size is not taken into accout seriously.We propose the space partition algorithm for large data sets based attribute purity partitioning, which introduce clustering notion and use entropy to partition data sets as measurement of attribute purity. The smaller entropy, the more pure subset segmented, in other words the greater similarity (or homogeneity) of internal subset.2. In general, some information may lose after partitioning. Therefore, one major consideration is how to keep important information. The RLDS (RLDS, reduction algorithm for large data sets) based attribute purity partitioning and representative instances extracting is proposed. Which can search central instance of each subset by Euclidean distance and find k nearest neighbors of central instance, then the two components compose reduction of training sets. The complexity and information theory analysis of algorithm illuminate that time complexity is much less than classical rough set and algorithm can rapidly find a reduction which is an approximately simplest set of original large data sets. 3. We propose a novel measurement—attribute classification value for selecting attribute in each node based on rough set and a new decision tree model construction algorithm (ACVS, attribute classification value for selecting) synthesized with reduction algorithm (RLDS)for large data sets.The ACVS make condition that different condition attribute but same class as compensative factor to expand discemibility matrix.The measure function of attribute classification value is proposed based on new discemibility matrix,which could be used to select attribute in node and more synthetically measures contribution of an attribute for classification. RLDS is the core method of optimization of the training sets.4. A decision tree classification model is designed and implemented. we implemente the evaluation test of algorithm performance in some UCI data sets, summarize the experiment and analyze the existing problems, and propose future research goals and direction.
Keywords/Search Tags:large data sets mining, rough set, decision tree, attribute purity, discemibility matrix, attribute classifition value
PDF Full Text Request
Related items