Font Size: a A A

A Study On Spark-based Distributed Collaborative Filtering And Its Tools

Posted on:2018-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2348330512998640Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Mobile Internet and Internet of Things,the amount of data collected by human beings increases exponentially.Distributed computing has become an indispensable key technology in the process of large data processing and analysis.Through decomposing complex tasks into multiple subproblems that can be executed concurrently on multiple interconnnected nodes,distributed computing solves the single bottleneck,difficult to extend problems of traditional algorithms.Thus,the research on distributed machine learning algorithms has become the focus of research in industry and academia.Among many distributed computing frameworks,Spark is widely used because of its high tolerance,high scalability and ease of use.However,the analysis and com-parison of the complexity of the distributed algorithms still lack an unified framework.Therefore,the analysis and comparison of the scalability and performance of specific algorithms on the Spark platform can only be done empirically.Based on the research of Spark distributed platform,this paper proposes a frame-work for analyzing the complexity of distributed algorithm on Spark,and using the collaborative filtering algorithm based on Spark as the application scenario.It turns out that the framework can effectively guide the algorithm development and runtime environment configuration.Specifically,the following work is done in this paper:Firstly,this paper introduces distributed computing and collaborative filtering tech-nology.The distributed computing section gives a detailed account of the comput-ing model,operation model and design concept of the popular Hadoop and Spark dis-tributed computing platform.The analysis and explanation of their principles are algo given.In the collaborative filtering part,the collaborative filtering based on memory and the collaborative filtering technology based on matrix decomposition are analyzed,and a variety of classical algorithms are introduced.Then,this paper proposes a complexity analysis framework for distributed algo-rithm on Spark.A variety of Spark distributed collaborative filtering algorithm have been analyzed based on that work.Finally,this paper designs a data mining toolbox based on Spark.The toolkit solves the problem that the analyst is difficult to use Spark by configuring the data mining algorithm and providing a configuration based data analysis application devel-opment model.Using this toolbox,users can easily use various distributed data mining algorithms to process large amounts of data without programming ability.In this paper,the function and development process of the toolbox are introduced in detail.
Keywords/Search Tags:Spark, Collaborative Filtering, Distributed Computing
PDF Full Text Request
Related items