Font Size: a A A

Optimization And Parallelization Of The GBRT Algorithm For Learning To Rank

Posted on:2016-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:L JinFull Text:PDF
GTID:2348330461460073Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of document data on the network is growing exponentially.Along with the explosive growth of the network information,people's needs of obtaining information more efficiently are growing too.Search engine is the exact application which assumes the important role in this task.Learning to Rank,as one of the most important techniques behind these industrial applications including search engine,is now arousing wide concern.Researchers have made great progress in creating all kinds of algorithms.However,with further increase of the amount of data in recent years,more and more researchers nowadays have started to pay attention to the efficiency of LTR training.GBRT(Gradient Boosting Regression Tree),as one of the most important algorithms in the field,is also facing the challenge of big data era.The training of GBRT features some problems such as the sequential essence of Gradient Boosting framework and low parallel granularity of single tree training.These problems are even more prominent when the scale of training data increases.In 2013,Baidu even list "Research in Parallelizing the Training of GBRT"3 as one of the most valuable research topics of Baidu and called for solutions4 and our research group applied for it successfully.Based on the background and the project we applied for,we carried out our research in accelerating the training speed of GBRT while maintaining accuracy.The research work consists of two parts including algorithm optimization and parallelization.Firstly,in algorithm optimization,we proposed a novel GBRT algorithm based on K-Meanshistogram called KH-GBRT.The Gradient Boosting framework has a sequential essence and is difficult to parallelize,so we focused primarily on speeding up the training process of a regression tree.First,we proposed an approximation algorithm based on K-Means histogram and combined it with regression tree to reduce the time complexity of finding the best split in splitting a tree node.This method also improves the parallel granularity of original regression tree algorithm.Secondly,in order to compensate the accuracy loss brought by approximation algorithm,we combined kernel density estimation and K-Means histogram and obtained better accuracy.Secondly,in parallel scheme,we proposed two parallel schemes based on MPI and the newly-introduced big data processing framework Spark according to their characteristics.MPI has the advantage of more flexible interfaces and better running efficiency while Spark features better fault tolerance,programmability and code portability.Users can choose either of them based on their needs.We also conducted a series of experiments on both public benchmark datasets and Baidu internal datasets to evaluate our work.The result shows that KH-GBRT based on MPI parallelization achieves an average speedup of 1.49-1.54 comparing to the current most efficient large-scale distributed GBRT algorithm under different configurations.It also shows a slightly better accuracy and near-linear scalability.The comparison between the GBRT algorithm of Baidu and KH-GBRT shows that KH-GBRT has better accuracy performance and an average speedup between 1.6 and 2.13.Although KH-GBRT based on Spark parallelization is slightly slower than its MPI counterpart,the implementation of KH-GBRT based on Spark shows better fault tolerance and programmability.It also shows quasi-linear scalability or super-linear scalability under certain circumstances.
Keywords/Search Tags:Learning to Rank, Gradient Boosting Regression Tree, K-Means Histogram, Kernel Density Estimation, MPI, Spark
PDF Full Text Request
Related items