Font Size: a A A

Application Of Learning-to-rank Method Based On Random Forest In Self-made Dataset

Posted on:2022-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q H HeFull Text:PDF
GTID:2518306737954159Subject:IC Engineering
Abstract/Summary:PDF Full Text Request
As a intersect-technology grounded on machine learning along with information retrieval,learning-to-rank has been broadly employed in Web search,document retrieval,recommendation system,question answering system and other realms.learning-to-rank will train different ranking models based on disparate learning-to-rank data sets(that is,constructing disparate features)and learning-to-rank methods,which will make ranking predictions for the newly input target list.As an integration technology based on bagging,random forest has been proved to have excellent prediction performance on small and medium-sized data sets.Build on this,this paper studies the learning-to-rank method based on random forest and applies it to self-made movie and company data sets,aiming to further improve the overall accuracy of movie and company ranking prediction.The main tasks of the thesis are:(1)A Random Forest-based Bootstrap Self-adaptive Double-ensemble(RF-based BSD)learning-to-rank method is proposed.Since appropriately reducing the Bootstrap ratio and adopting a dual integration-based method can effectively improve the model performance,a new Bootstrap adaptive function is designed and the base model of the random forest is replaced from a tree model to an integrated model.First,BSD will automatically determine the sub-sampling ratio of the random forest based on the number of queries,query-instance pairs,and feature numbers of the input learning-to-rank format data set.Then the single ensemble idea(boosting-based ensemble algorithm)is used to train the base ranker of the random forest,and finally the bagging idea is used to output the final double ensemble model.(2)Produce movie data sets and company data sets.First obtain website data through python crawler and construct relevant features.The characteristics of movie constructed include time series,theater,distributor,genre,series and others,in a total of six categories(including altogether 21 sub-categories of features).The characteristics of company constructed include ranking,revenue,value and others,in a total of four categories(including altogether 10 sub-categories of features).Then the original data is programmed with data missing processing,data normalization,data ranking,data label division and data format processing to obtain the data sets of learning-to-rank format.(3)A variety of learning-to-rank models are constructed for realize the ranking prediction of movies and companies,and experiments are carried out to confirm the availability of the proposed means.Firstly,the original random forest framework is used to train the single integrated model.Three kinds of boosting single integrated models are used as the base ranker,combined with the RF-based learning-to-rank method to train three different kinds of double integrated models.Then the four models are compared in the self-made movie data set and the company data set.Experimental outcomes reveal that the proposed means can efficaciously acquire the optimal sub-sampling ratio of random forest.By comparing the two evaluation indices of mean average precision and normalized discounted cumulative gain,the performance of the double ensemble model trained by the method proposed in this paper is higher than the single ensemble model trained by the original method.At the same time,the ranking prediction results of the proposed method are basically consistent with the ranking lists on the website,and the mean average precision of the best-performing model is basically above 98%.
Keywords/Search Tags:Learning-to-rank, Random forest, Self-made data set, Double-ensemble model, Movie and company ranking prediction
PDF Full Text Request
Related items