The rapid development of Internet technology has greatly reduced the threshold for people to obtain information.At the same time,the Internet is generating huge amounts of data all the time.Human society has entered an era of information overload.As an effective means to solve the problem of information overload,the personalized recommendation system based on collaborative filtering can extract information that users are interested in from the massive data by analyzing the user's historical behavior.It not only helps people improve the efficiency of obtaining valuable information,but also enables information to be accurately displayed in front of users who are interested in it,bringing huge economic benefits to the enterprises.In empirical applications,the traditional collaborative filtering recommendation system faces problems of data sparsity and cold start,which results in lower accuracy as well as running efficiency of the system.To tackle these problems,we propose a recommendation system based on user ratings and reviews using big data preprocessing technology,designed and implemented on Spark.The major work of this thesis is as follows:(1)We propose a method of item feature extraction,which is based on big data preprocessing.First,we preprocess the reviews data.The process includes data aggregation,missing value imputation,deduplication and format conversion.After that we extract item features from the pre-processed data using the Word2 Vec model,which has achieved quite good results.(2)We propose a recommendation algorithm based on user ratings and reviews.First,we use word frequency,ratings,reviews' time and the helpfulness degree of reviews to improve item features,and then the similarity between two items is calculated according to the item features.At last,a recommendation list is generated based on predicted user ratings of items which is derived from the similarity calculated before.(3)We designed and implemented the recommendation system on Spark.First,the recommendation engine is divided into two modules: online computing and offline computing,to achieve parallel execution of the algorithm.After that we validate the effectiveness and scalability of the recommendation system on Amazon products data.(4)The experimental results have shown that compared with many traditional collaborative filtering recommendation algorithms,our algorithm has certain advantages on MAE,and the prediction accuracy is much better than the item-based collaborative filtering algorithm.Overall,our method performs better than the traditional collaborative filtering algorithms and effectively alleviates the problems of data sparsity and cold start. |