The rapid development of Internet has brought the information bonus to human beings,but with the fast expansion of the network data,it is very difficult for users to get the information or service that meets their own taste immediately.Recommending algorithm came into being in this situation,and it can predict the goods or services that users need by extracting information from the users’ log.The recommendation system then recommends these goods or services to users.However,with the rapid development of e-business and the increase of user behavior logs in recent years,traditional recommendation algorithm can not quickly recommend to users because of the hardware constraints.The emergence of distributed framework provides an opportunity for the further development of the recommending algorithm,and the Spark computing framework introduces resilient distributed dataset and the common operators,all these make distributed computation more advantageous in the algorithm implementation and model training.However,there is no so much research on the recommendation algorithm based on spark at home and abroad.In this thesis,therefore,the proposed algorithm and the parallel recommendation algorithm based on Spark are studied and implemented in this situation.The work of this article mainly includes the following two parts:1.Improve Model-based ALS recommendation algorithm,and propose a recommendation algorithm based on the theme model.The item feature file is modeled by improved LDA algorithm,and the document-topic probability distribution is extracted.Then the kl-divergence measure is proposed to calculate the item similarity,and the high similarity matrix of the item is obtained by maximum threshold and the number of neighbors.And then the high similarity matrix and the original score matrix are combined to get the forecast score,which is written into the original score file to fill the training set.Finally,the author uses the ALS algorithm to train the mode l and predict the score.The improved algorithm fills the original data set into the item file,which solves the problem of cold start of the item and alleviates the sparsity of data.The experimental results show that the improved algorithm has less predictive error than ALS and other related collaborative filtering algorithms.2.A parallel recommendation algorithm is proposed,which is based on spark distributed dataset RDD and spark operator.Firstly,this thesis analyzes the source code of SparkLDA algorithm and Spark ALS algorithm,and proves the feasibility of parallelism.Then the Spark LDA algorithm and the Spark ALS algorithm are integrated by using the spark operator in the Spark computing framework.Input the movie critic information crawled through the film feature file into Spark LDA algorithm after the parallel word segmentation,and then get the document-topic distribution RDD,and obtain the predictive scoring RDD by combining cartesian and join operator with the original score RDD.and then use union operator to combine the predictive scoring RDD with the original score RDD group to get the training set RDD.Finally,this thesis gets the parallel training model by inputing the training set RDD into the Spark ALS algorithm.Through the experiment o f MAE,it is proved that the prediction accuracy of the improved parallel algorithm is higher than the SparkALS algorithm,and it is most obvious under the condition of the similarity of K L divergence measure.Through the parallel algorithm of multiple nod es and the experiment of single node serial algorithm,it shows that the improved parallel algorithm has great parallel effect on large amount of data,and solves the problem of time efficiency in some extent. |