Font Size: a A A

Research Of Calibrated Label Ranking Multi-label Algorithm Based On Spark

Posted on:2019-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ZhangFull Text:PDF
GTID:2428330590465762Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the expansion of data scale and the explosive increase of data size,traditional label technology has failed to satisfy people's needs.Various fields of data show different characteristics,including variety in data type,low density in data value and timely requirement in data processing.In traditional single-label data mining,the classification of each sample belongs to one category.There is a certain potential connection between things,this association divides a original category into more categories.With the development of multi-label technology,there is appearing a lot of multi-label machine learning algorithms that performance is inconsistent in different situations,selecting appropriate algorithms for different field scenarios is a means to improve the accuracy of prediction.From first order,second order to higher order strategy multi-label learning method,the label correlation of aeach order strategy is great different in performance and the complexity changes dramatically.In order to explore the correlation between labels and this paper selects a second-order multi-label learning method,namely Calibrated Label Ranking algorithm(CLR).The traditonal calibrated label ranking algorithm uses pairs of label associations to transform and predict result.Its algorithmic cabration is achievely comparing it with the basis of binary relevance(BR).Its prediction has a certain dependence on the results of BR,thus incurring some limitations on the prediction of some datasets.When the features and number of data sample are constantly increasing,the direct use of the serial method may take too long to get the result in time.The use of Spark parallelization will effectively reduce the computating time.Therefore,this paper proposes a method that calibrated label ranking based on Spark parallel computing.The main content as follows:1?To better distinguish betweenthe relevance and irrelevance of the label,a method is presented for calibrating label boundary regions,which further corrects the boundary portion of the relevant label and the irrelevant label using Bayesian probability,thereby improving the accuracy of the classification of the boundary domain.CLR method based on Naive Bayes(NBCLRM)presented is compared with seven traditional methods such as calibrated label ranking.Experimental results show that the proposed algorithm can not only adjust prediction results by modifying thethresholds ? and ?,but also effectively improve the preformance of traditional multi-label learning methods.2?Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks.NBCLRM is combined with Spark distributed parallel computing framework and takes full advantage of Spark parallel computing can effectively solve the problem of long running time and low execution efficiency of this algorithm.
Keywords/Search Tags:machine learning, multi-label, calibrated label, label correlations, parallel, Naive Bayes, Spark
PDF Full Text Request
Related items