Font Size: a A A

The Research On Random Forest And Its Parallelization Oriented To Unbalanced High-dimensional Data

Posted on:2017-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2308330482499734Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Random Forest Algorithm belongs to a kind of integrated data mining algorithm, and is mainly used to solve the problem of sorting. The Base-Classifier of Random Forest Algorithm is realized by Decision Tree, and a large number of Decision Trees combine to be a structure of the Random Forest. Compared with The Single Classification, the Random Forest is better accuracy and lesser generalization error, therefore, The Random Forest Algorithm has become an important solution to sorting problems, and wildly applied to real life and industrial production. Nevertheless, when processing unbalanced high-dimensional data, the Random Forest Algorithm would emerge the problem of low accuracy of sorting big generalization error and so on.At present, it is rarely researched on the Random Forest Algorithm of unbalanced high-dimensional data, but this paper focuses on researching on the Random Forest Algorithm of unbalanced high-dimensional data and its parallelization implementation based on the analysis of unbalanced data classification algorithm, and unbalanced high-dimensional data algorithm. To reply to the disadvantage of processing of balance of data, this paper references the thought of balancing algorithm of data level, and proposes a balance processing method for unbalanced high-dimensional data by combined under-sampling with over-sampling. To solve the problem of low accuracy of sorting when processing high-dimensional data by traditional random forest algorithm, this paper improves the generation process of characteristic sub-space which contained in traditional random forest algorithm, and proposes a random forest algorithm which aims at high-dimensional data.The Decision Tree independent training process and the voting process of the Random Forest determines the algorithm, and it has very a good parallelization potential. Spark platform is now a very popular distributed computing platform, and able to form a memory iteration approach for implementing parallel algorithm.In the era of big data, centralized algorithm is increasingly difficult to meet the needs of efficient data processing in the face of massive data. In this paper, based on the Spark platform to the high-dimensional data of random forest algorithm for parallel implementation, it improves the efficiency of the algorithm.This paper mainly consists of the following four parts:the first is to organize and study the literature and the technology of the distributed platform which is related to this article; the two is to propose a kind of balanced treatment method for high-dimension unbalanced data; the three is to propose a kind of the Random Forest Algorithm for high dimensional data in the classification of high-dimensional data; Finally, the experimental results are given to test and evaluate the performance of the proposed algorithm and the parallel algorithm.
Keywords/Search Tags:random forest, high-dimensional, unbalanced, characteristic sub-space, Spark platform
PDF Full Text Request
Related items