Font Size: a A A

Research On Improved Random Forests Algorithm Based On The Balance Maximization And Consensus Maximization

Posted on:2017-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhuFull Text:PDF
GTID:2308330482992235Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past few decades, with the improvement of computer performance and the decline in the cost of hardware and software, computer is becoming more and more powerful. The data collection and the large supply of storage devices, help to promote the rapid development of database and information industry. And the degree of social information has been improved greatly, so the data volume increase sharply at the same time. However, the important knowledge behind the data has not been better understood. Data mining is a method of solving these problems. Especially, it is called supervised learning when the process of learning the category of data belongs to is done under the guide of someone. Random forests is exactly a method of supervised learning.Random forests is a model combination algorithm, and it comes from the ensemble learning methods in machine learning. It could get multiple classifiers according to the training set, and then combines the multiple results as the final result of classification to improve the accuracy of the integrated classifier. Random forests algorithm has great performance in many areas such as pattern recognition, text classification, commodity recommendation, and so on. While domestic researchers focused on the application of random forests in some specific field, research on the improvement of algorithm itself is less. In particular, under the background of big data, study on the performance and accuracy of the random forests is not enough.This paper focuses on the use of random forests algorithm dealing with classification problems of big data. It can promote the improvement of performance from two aspects. One is data pretreatment, it helps to solve the problem of unbalanced data classification based on random forests. The other is model combination, and it is the enhancement of random forests itself.On the one hand, this paper analyzes the problem of unbalanced data sets to classification based on classification algorithm, and then summarizes the common methods of balancing data as well as their defects. So the paper puts forward a novel algorithm called adaptive random sampling algorithm based on the balance maximum. Experiments show that the proposed algorithm performs well with the unbalanced data.On the other hand, the paper has further improved the original random forests algorithm by using the strategy of consensus maximization instead of majority voting. It proposes another new algorithm called model combination algorithm based on the consensus maximization. What is more, considering the empirical error and the generalization error of the model combination algorithm, the new algorithm makes each single classifier give full play to individual strengths. As a result, it strengthens the advantage of good classifiers and weakens the disadvantage of poor classifiers. Experiments prove that it can further enhance the classification performance of the model combination. It means that it has high accuracy and strong generalization ability.
Keywords/Search Tags:Random Forests, Model Combination, Balance Maximization, Consensus Maximization, Majority Voting, Generalization Error
PDF Full Text Request
Related items