Font Size: a A A

Implementing And Optimizing The Alternating Least Squares Algorithm On Many-cores

Posted on:2019-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2370330623450959Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The task of recommender system is to combine users and items and then help massive users find valuable information.An excellent recommender algorithm can greatly improve the efficiency of the recommender systems,enhance the accuracy of the recommender results and meet the real-time needs of massive Internet users.At the same time,heterogeneous architectures have been developing dramatically to enable the combination of the high-performance computing technology and recommender systems.In this work,we aim to implement and optimize a typical algorithm of recommender systems-alternating least squares(ALS),on modern multi-cores and many-cores..In brief,we make the following contributions:1.We implement ALS with OpenCL to enable the portability of the algorithm on various platforms.To do so,we apply specific optimization techniques for different platforms according to their corresponding features.Thus,we generate various code variants,and researcher can select a suitable one based on their target platform.The experimental results show that our optimized ALS implementation performs 5.5× and21.2× faster than the baseline implementation on Intel E5-2670 and K20 C,respectively.Meanwhile,our implementation outperforms cuMF on K20 C over various datasets.2.Based on an indepth analysis of the algorithm hotspot and the existing ALS implementations,we propose an efficient implementation using a fine-grained parallel optimization technique.We implement the fine-grained task partition and use a more suitable thread configuration.The experimental results show that our implementation performs upto 88× faster on K20 C and upto 98× faster on gfx803 than the baseline implementation.On K20 C,our implementation also outperforms cuMF over different latent feature size ranging from 10 to 100 with various recommendation datasets.3.According to our statistics of recommendation datasets and the analysis of the previous implementations,we propose an efficient implementation based on ALS algorithm on top of a new sparse matrix blocked storage format for parallel matrix factorzation.Specifically,we develop a new compressed storage format for sparse matrices,and a data reuse technique.We observe that the repeated data loads of row/column vectors can be avoided by this data reuse technique.Further,we propose a data reordering technique to enhance the performance benefits of data reuse.Our experimental results show that our novel implementation attains a speedup of 2.08× and3.72× over the state-of-the-art fastest Gates' implementation on K20 C and TITAN X,respectively.To the best of our knowledge,this is the fastest ALS implementation for matrix factorization on modern many-core GPUs.In a nutshell,we develop an efficient ALS implementation by comprehensively considering the features of the ALS algorithm,the specifics of modern many-corearchitectures,and the inherent characteristics of recommender datasets.By doing so,we can sachieve a more balanced usage of both on-chip and off-chip hardware resources.We believe that our implementation can be integrated into modern bigdata processing framework such as Spark in a straightforward way.
Keywords/Search Tags:Alternating least squares, Portability, Data Tiling, Data reuse, Data reordering
PDF Full Text Request
Related items