Optimization And Parallelization Of The GBRT Algorithm For Learning To Rank

Posted on:2016-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:L Jin

Full Text:PDF

GTID:2348330461460073

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the amount of document data on the network is growing exponentially.Along with the explosive growth of the network information,people's needs of obtaining information more efficiently are growing too.Search engine is the exact application which assumes the important role in this task.Learning to Rank,as one of the most important techniques behind these industrial applications including search engine,is now arousing wide concern.Researchers have made great progress in creating all kinds of algorithms.However,with further increase of the amount of data in recent years,more and more researchers nowadays have started to pay attention to the efficiency of LTR training.GBRT(Gradient Boosting Regression Tree),as one of the most important algorithms in the field,is also facing the challenge of big data era.The training of GBRT features some problems such as the sequential essence of Gradient Boosting framework and low parallel granularity of single tree training.These problems are even more prominent when the scale of training data increases.In 2013,Baidu even list "Research in Parallelizing the Training of GBRT"3 as one of the most valuable research topics of Baidu and called for solutions4 and our research group applied for it successfully.Based on the background and the project we applied for,we carried out our research in accelerating the training speed of GBRT while maintaining accuracy.The research work consists of two parts including algorithm optimization and parallelization.Firstly,in algorithm optimization,we proposed a novel GBRT algorithm based on K-Meanshistogram called KH-GBRT.The Gradient Boosting framework has a sequential essence and is difficult to parallelize,so we focused primarily on speeding up the training process of a regression tree.First,we proposed an approximation algorithm based on K-Means histogram and combined it with regression tree to reduce the time complexity of finding the best split in splitting a tree node.This method also improves the parallel granularity of original regression tree algorithm.Secondly,in order to compensate the accuracy loss brought by approximation algorithm,we combined kernel density estimation and K-Means histogram and obtained better accuracy.Secondly,in parallel scheme,we proposed two parallel schemes based on MPI and the newly-introduced big data processing framework Spark according to their characteristics.MPI has the advantage of more flexible interfaces and better running efficiency while Spark features better fault tolerance,programmability and code portability.Users can choose either of them based on their needs.We also conducted a series of experiments on both public benchmark datasets and Baidu internal datasets to evaluate our work.The result shows that KH-GBRT based on MPI parallelization achieves an average speedup of 1.49-1.54 comparing to the current most efficient large-scale distributed GBRT algorithm under different configurations.It also shows a slightly better accuracy and near-linear scalability.The comparison between the GBRT algorithm of Baidu and KH-GBRT shows that KH-GBRT has better accuracy performance and an average speedup between 1.6 and 2.13.Although KH-GBRT based on Spark parallelization is slightly slower than its MPI counterpart,the implementation of KH-GBRT based on Spark shows better fault tolerance and programmability.It also shows quasi-linear scalability or super-linear scalability under certain circumstances.

Keywords/Search Tags:

Learning to Rank, Gradient Boosting Regression Tree, K-Means Histogram, Kernel Density Estimation, MPI, Spark

PDF Full Text Request

Related items

1	Thermal Power Plant Energy Saving Analysis Based On Spark Big Data Platform
2	Reordering Of Question Answering System Based On Deep Learning And Gradient Lifting Tree Algorithm
3	Link Quality Estimation Method Based On Gradient Boosting Decision Tree
4	K-Means-Gradient Lifting Algorithm For Screening Advertisements On Network Platform
5	Intrusion Detection Model Optimization Scheme For Gradient Boosting Trees In Edge Computing
6	Research On Sparse Kernel Regression Modelling Methods
7	Research On Structure-Activity Relationship Of Semiconductor Materials Based On Machine Learning
8	Research On Context-aware Recommendation Method Based On Gradient Boosting Decision Tree
9	Algorithm Study On Non-Parametric Kernel Density Clustering And Feature Extraction
10	Research On D2D Power Control Algrothims Based On Data-Driven And Optimization-Driven