Font Size: a A A

The Research And Application Of Classification Algorithm’s Parallelization

Posted on:2015-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:H T WangFull Text:PDF
GTID:2308330473452720Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the "information explosion" era, the application about data mining become more widely. Many business decision makers use data mining techniques to obtain useful information from vast amounts of dataset, providing more useful advice for future decision-making. However, the efficiency of traditional data mining algorithms in the face of massive data, due to various reasons, is low and can not meet the growing needs. So we need to find a more efficient algorithm or implemental strategies. In the bank or related financial industry, with the continuous expansion of the popularity of credit cards and lending business, service providers need to understand the customer’s creditworthiness and reduce credit risk activities in order to better carry out related activities. To solve these questions, this thesis selects classification algorithm for analysis from data storage and algorithm, pointing out their deficiencies in data mining, and selects Sprint decision tree classification algorithm as a specific object for study, improve,and optimize. Finaly, the improved algorithm is applied to the bank customer credit evaluation system in order to mining classification rules, so we draw a conclusion that this thesis have some theoretical and practical significance.The main work of this thesis:(1) Analyzed several typical classification algorithms(decision tree algorithm, neural networks, Bayesian networks and genetic algorithms) of the basic theory and principles, and presented their basic strategy of parallelization by summarizing the results of previous studies;(2) Analysis of three features of current data mining applications: data is mainly stored in a traditional relational database, massive dataset need to be deal with and data mining operation based mainly on the column-oriented. By analyzing these features, we include this conclusion: traditional row-oriented storage and serial algorithms can not meet people’s demands for efficiency, we need to find more efficient storage methods and implementation strategies to be replaced;(3) Among many decision tree algorithms, we choose Sprint algorithm as a specific object of study, and point out their defects on the current practice of parallel data mining, including the demerit of data storage methods and the limitations of the algorithm itself. Through analyzing the row-oriented storage and cloud storage, we choose column storage mode when the traning dataset and attribute lists need to be stored. Meanwhile, in order to reduce I/O operation, this paper takes some improvements in spliting split attribute and non-split attributes and gives some parallelization strategies of the improved Sprint algorithm;(4) Finaly, this thesis used Java RMI(Java Remote Method Invocation) mechanism to implement the improved Sprint algorithm and applied in the classification data mining module of the bank customer credit evaluation system. Then, through analyzing and comparing the performance of column-oriented and row-oriented database, we include this conclusion: training dataset and attribute lists stored in column database can improve the utilization of storage space and query efficiency. On the other hand, this conclusion can be drawn by compared Sprint algorithm and the improved algorithm that the improved one can reduce I/O consumption caused by accessing disk, and greatly reducing the execution time, especially facing the massive dataset. Hence, the improved algorithm is effective.
Keywords/Search Tags:Classification algorithms, decision trees, Sprint, parallelism, column-store
PDF Full Text Request
Related items