| The popularization of information technology has met the needs of people to obtain information resources.However,with the rapid development of the Internet,the amount of data generated is also growing rapidly,and the problem of "information overload" has emerged,which makes it difficult for users to find the information they need from the explosive growth of information.At present,the mainstream technology to solve this problem is personalized recommendation systems.Personalized recommendation systems can use recommendation algorithms to actively recommend information that may meet user needs based on their past history when user needs are unclear.The core of the recommendation system is the recommendation algorithm,and the collaborative filtering algorithm is the most widely used recommendation algorithm.It is simple and efficient,but there are also problems such as sparse user item scoring matrix,low real-time,cold start,etc.This thesis attempts to introduce clustering technology into recommendation algorithms to these problems,and has achieved good results.The following work has been specifically carried out:(1)In order to solve the problem that the accuracy and time efficiency of traditional user based collaborative filtering recommendation algorithm need to be improved due to the sparse scoring matrix and long time spent searching for nearest neighbors,a collaborative filtering recommendation algorithm PMCF that integrates PCA dimension reduction and Mean shift clustering is designed.This algorithm uses principal component analysis(PCA)to preserve the dimensions that best represent user interests,so as to alleviate the sparsity problem of the rating matrix;the Mean shift clustering algorithm is used to cluster users in the reduced dimension low dimension vector space to reduce the search range of the nearest neighbor of the target user.In order to improve the real-time performance of recommendations,the PMCF algorithm has been parallelized and implemented on the Spark platform.(2)In order to solve the problem of user cold start,an improved PMCF algorithm IPMCF was designed by introducing user demographic information and clustering ideas on the basis of the PMCF algorithm.After the PMCF algorithm completes the Mean shift clustering of users,the IPMCF algorithm further clusters multiple subcategories based on the basic attribute labels of users in each cluster,and takes the basic attribute value with the highest mode as the basic attribute label of the subcategories.For the target new user,the algorithm calculates the cosine similarity of the basic attribute labels between it and each subcategory in each cluster,and takes all users in the most similar subcategory in each cluster as close neighbors of the new user;removes items that the number of it being rated is low than the threshold from the favorite items of neighboring users;then sorts the items in descending order based on the average score of neighboring users on their favorite items,and recommends the top N items to the target new user.In order to improve the real-time performance of recommendations,this article also parallelizes the IPMCF algorithm on the Spark platform.The experimental results on the Movielens dataset and the Het Rec2011-Movielens-2k dataset show that the PMCF algorithm can effectively improve the accuracy of recommendation results while having high time efficiency;The IPMCF algorithm can further solve the problem of user cold start based on the PMCF algorithm;The parallelized PMCF algorithm and IPMCF algorithm have significant time efficiency advantages as the data size increases,and it can improve the real-time performance of recommendations. |