Font Size: a A A

Research Of Data Classification Algorithms In Data-intensive Computing Environments

Posted on:2014-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q Z DengFull Text:PDF
GTID:2248330398498029Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, large data is a very hot concept, with the pass of time, the amount of data generated by enterprise become increasingly large, which including customer purchasing preference trends, web access and habits, customer review data, and so on. How to dig out the useful information is the most important issue for researchers and users, the research of data mining for data-intensive computing environments become a focus issue in this environment.This paper describes the characteristics and typical applications of data-intensive computing, summarizes the data mining research status of data-intensive computing environment, traditional classification methods of data mining in data classification, the typical decision tree algorithm of classification. We also introduce the parallel mining strategy, Hadoop distributed system architecture and the SPRINT algorithm. The SPRINT algorithm is a traditional decision tree classification algorithm which has good scalability and more efficient, parallel mining strategy is the main research direction of data mining and Hadoop is the best choice for data processing. The main data mining research work for data-intensive computing environment focuses on how to efficient data mining and management functions with the advantage of the large-scale cluster system which has advantages of scalability and fault tolerance.In this paper, we propose a new data classification decision tree algorithm MR-DIDC to deal with the data mining problem of data-intensive computing environment, main idea of the algorithm is detailed illustrated with an example. MR-DIDC algorithm is an improved algorithm which based on SPRINT algorithm and MapReduce programming framework, combined with the parallel computing capacity of the MapReduce programming framework to achieve the computing of the best splitting attribute in decision tree node expansion process and its split point and attribute list segmentation to improve the efficiency of algorithm. MR-DIDC makes a change and introduces some new data structures to cope with parallel operation of the algorithm. The data structure like follows:attribute list, histogram, block count matrix, block hash table and block histogram. The attribute lists have the same function with SPRINT algorithm; initial lists for continuous attributes are sorted by attribute value once when first created. Continuous attributes maintenance histogram information, each data block has two histograms donated as Cabove and Cbelow, are used to capture the class distribution of the attribute records at a given node. Categorical attributes maintenance block count matrix information, which contains the class distribution for each value of the given attribute at each data block. The block histogram is a new data structure introduced by algorithm, it is used to capture the global class distribution at each data block, Using block histogram, and we can simplify the calculation process of split point. During the calculation of the split point, without mutual communication between each of the data nodes so that it can reduce the I/O times and increase the data availability of the algorithm. The experimental results show that MR-DIDC algorithm has a good scalability and high data availability for big data when running on large clusters.
Keywords/Search Tags:Big Data, Data-intensive, Data Classification, MapReduce, SPRINT
PDF Full Text Request
Related items