| In recent years,due to the rapid development of smart mobile devices such as mobile phones and Internet technology,the amount of data generated by people has increased exponentially,so there is a huge amount of data on the network.In such a big data environment,traditional recommendation algorithms face severe challenges such as low execution efficiency,data sparsity and poor scalability.Traditional recommendation systems are difficult to efficiently store,analyze and manage such massive data.In order to solve or alleviate the above problems,new platform architectures and more suitable recommendation algorithms are urgently needed.In view of the shortcomings of the current research on recommendation algorithms,this thesis deeply research on the current mainstream big data parallel computing framework and recommendation algorithms,then combines the idea of clustering algorithm and distributed parallel computing technology,propose a matrix factorization collaborative filtering recommendation algorithm optimization research scheme based on Spark.The main research work is as follows:1.A clustering collaborative filtering recommendation algorithm combining Canopy Kmeans clustering algorithm and ALS based matrix decomposition algorithm is proposed.First,user feature preference matrix is constructed using user project interaction data,and then user feature preference vectors are clustered using the improved Canopy Kmeans algorithm to reduce the dimension and sparsity of the matrix.Then,ALS matrix decomposition prediction models are established for each cluster group.This method aims to solve the problem of data sparsity caused by excessive data volume in the context of big data,so as to improve the accuracy of the algorithm.And this method can better support parallel computing,and can use the distributed parallel computing framework to further improve the efficiency of the algorithm.2.Combining the improved recommendation algorithm with Spark distributed computing framework and other relevant big data technologies can effectively improve the computing efficiency and scalability of the recommended algorithm,and can be integrated in Hadoop.Hadoop’s distributed file system can be used to store and manage massive data files,and the improved algorithm in this thesis can be implemented in parallel on the Spark platform to improve the efficiency and scalability of the algorithm.In the process of experimental evaluation,this thesis uses multiple datasets of Movielens to conduct experiments.In view of the recommendation effect of the algorithm,it conducts comparative experiments on multiple indicators such as average absolute error,root mean square error,mean square error,accuracy rate,and normalized cumulative loss gain.The experiments show that the improved algorithm has significantly improved the fitting effect of the recommendation model,the accuracy of the recommendation,and the quality of TOP-N recommendation list;For the execution efficiency of the algorithm,several comparative experiments have been carried out on the running time,acceleration ratio,expansion ratio and other indicators of the algorithm in different size datasets and different cluster sizes.The experiments show that the implementation efficiency and scalability of the algorithm based on Spark have been significantly improved. |