The Research On Random Forest And Its Parallelization Oriented To Unbalanced High-dimensional Data

Posted on:2017-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Wang

Full Text:PDF

GTID:2308330482499734

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Random Forest Algorithm belongs to a kind of integrated data mining algorithm, and is mainly used to solve the problem of sorting. The Base-Classifier of Random Forest Algorithm is realized by Decision Tree, and a large number of Decision Trees combine to be a structure of the Random Forest. Compared with The Single Classification, the Random Forest is better accuracy and lesser generalization error, therefore, The Random Forest Algorithm has become an important solution to sorting problems, and wildly applied to real life and industrial production. Nevertheless, when processing unbalanced high-dimensional data, the Random Forest Algorithm would emerge the problem of low accuracy of sorting big generalization error and so on.At present, it is rarely researched on the Random Forest Algorithm of unbalanced high-dimensional data, but this paper focuses on researching on the Random Forest Algorithm of unbalanced high-dimensional data and its parallelization implementation based on the analysis of unbalanced data classification algorithm, and unbalanced high-dimensional data algorithm. To reply to the disadvantage of processing of balance of data, this paper references the thought of balancing algorithm of data level, and proposes a balance processing method for unbalanced high-dimensional data by combined under-sampling with over-sampling. To solve the problem of low accuracy of sorting when processing high-dimensional data by traditional random forest algorithm, this paper improves the generation process of characteristic sub-space which contained in traditional random forest algorithm, and proposes a random forest algorithm which aims at high-dimensional data.The Decision Tree independent training process and the voting process of the Random Forest determines the algorithm, and it has very a good parallelization potential. Spark platform is now a very popular distributed computing platform, and able to form a memory iteration approach for implementing parallel algorithm.In the era of big data, centralized algorithm is increasingly difficult to meet the needs of efficient data processing in the face of massive data. In this paper, based on the Spark platform to the high-dimensional data of random forest algorithm for parallel implementation, it improves the efficiency of the algorithm.This paper mainly consists of the following four parts:the first is to organize and study the literature and the technology of the distributed platform which is related to this article; the two is to propose a kind of balanced treatment method for high-dimension unbalanced data; the three is to propose a kind of the Random Forest Algorithm for high dimensional data in the classification of high-dimensional data; Finally, the experimental results are given to test and evaluate the performance of the proposed algorithm and the parallel algorithm.

Keywords/Search Tags:

random forest, high-dimensional, unbalanced, characteristic sub-space, Spark platform

PDF Full Text Request

Related items

1	Research On Random Forest Classification Algorithm Based On Spark Distributed Platform
2	Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest
3	Research On Parallelization And Optimization Of Random Forest Classification Algorithm Based On Spark
4	Research On User Loan Risk Prediction Based On Random Forest Algorithm Based On Spark Platform
5	Classification Of Encrypted Traffic Application Service Based On Spark Platform
6	Evaluation Of Confounder-controlled Random Forest And Its Application In High Dimensional Data Analysis
7	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
8	Performance Prediction And Optimization For Apache Spark Platform
9	The Application Of Ensemble Classification On Unbalanced Data In Bank Marketing
10	Research On Parallel Text Classification Algorithm Base On Random Forest And Spark