Font Size: a A A

Accelerating Address Translation On Neural Network Processors For Click-Through Rate Estimation

Posted on:2024-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:F DingFull Text:PDF
GTID:2568306929490694Subject:Computer Science and Technology
Abstract/Summary:
With the rapid development of deep learning,deep learning-based recommender systems have played an increasingly important role on Internet platforms,such as online advertising,web searching,and e-commerce.In these recommender systems,the click-through rate prediction algorithm,as a core algorithm that can affect the display order of recommended items and can reflect the quality of recommendation,has always attracted much attention.However,the inference and training of the click-through rate prediction algorithm often encounter challenges such as large-scale training data,high sparsity of features,and high frequency of updating.In order to cope with the dual challenges brought by the click-through rate prediction algorithm and the failure of Moore’s Law,more and more AI vendors and Internet platform enterprises have deployed recommender systems and click-through rate prediction algorithms on ASICs designed for deep learning(referred to as the neural network processor,aka "NPU"),to obtain orders of magnitude speedup and energy efficiency.However,there are still many problems and challenges to be solved.This paper focuses on the challenges that click-through rate prediction algorithms pose to the NPU’s address translation mechanism when executing large-scale embedding layers.After an in-depth study of the operating characteristics of the click-through rate prediction algorithm on NPU,this paper makes three contributions to the bottleneck of the current NPU memory management unit(MMU):1.We designed and developed an event-driven cycle-accurate simulator,called MMU-Sim.Based on MMU-Sim,a complete address translation subsystem model has been built,which provides a solid foundation for the implementation and verification of several latter optimization schemes.2.For the bursty and sparse address translation requests that occur during the execution of the click-through rate prediction algorithm,two design schemes—multi-stream TLB design and merging buffer,are proposed according to the design principle of high throughput.Compared to the baseline scheme,two design schemes achieve 3 times speedup averagely and about 5 times speedup averagely respectively,reducing the conversion delay by 70%-90%.3.For the long latency problems when executing the click-through rate prediction algorithm,two design schemes—page-table cache and secondary shared TLB,are proposed according to the design principle of low latency.Compared to the baseline scheme,two design schemes respectively obtain about 2.8 times speedup averagely and reduce the average latency of address translation by 90%.In conclusion,this paper studies the current mainstream recommender systems and click-through rate prediction algorithms,summarizes the general architecture and address translation behaviors on NPU,in order to solve bursty address translation requests,high sparsity memory access pattern and long latency problems.With the help of MMU-Sim simulator,we propose a novel architecture of NPU MMU with higher throughput,faster performance,and lower latency.Experiments show that when dealing with the large-scale embedding lookups,the novel MMU can achieve 15.37 times speedup compared to the mainstream MMU,and reduce the average latency by 93.82%,with nearly 100%area overhead.
Keywords/Search Tags:recommender systems, Click-Through rate, address translation, memory management method, deep learning
Related items