Font Size: a A A

Design And Implementation Of Computing And Storage Platform For Big Data Recommendation

Posted on:2017-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:C X LiFull Text:PDF
GTID:2348330509957115Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the coming of big data era, information is growing quickly. In such an Internet Environment with overloaded information, it's more and more difficult for Internet users to access the information that they are interested in, so how to allow users to get the information they like as efficient as possible is a hot research for big data researchers. Recommended System provides a solution for this problem, who is an active information push system, message producers push the messages to suitable message consumers by predicting users' preferences on vast history records.Because vast history records is the basic for predicting user's preferences,Recommended System will process big data. In order to meet the need for users,Recommended System should have big data processing capability.Hadoop is the first-generation big data processing framework, it was a popular solution in Recommended System's structure. However, with the progress of the algorithms for recommending, Hadoop's computation model: Map Reduce become more and more difficult to meet the performance requirements. Apache Spark is a popular new-generation big data processing framework in recent years, which is very suitable for recommendation algorithms. Implementing offline and online recommendation algorithms by Spark will improve the system's performance effectively.For the purpose of providing solutions for big data recommended system, this paper studied in two levels recommended system requires: computing services and storage services. In computing services level, we implement offline and online algorithms for recommended system, and optimize the computing efficiency. In storage services, we import Erasure Code storage method in HDFS, and build Mongo DB serving for data storage. Specific studies include:(1) This paper implemented the offline recommendation algorithm base on ALS in Spark programming model, and optimized Spark scheduling strategy. This paper found that when running offline algorithm on heterogeneous cluster, original Spark scheduling strategy is not irrational. So the paper optimized original scheduler and proposed an optimized scheduling strategy: nodes resources priority-based scheduling policy. Experiment results showed that new scheduling strategy got better efficiency than original strategy in the case of running offline algorithms.(2) This paper proposed a fast real-time recommendation algorithm and implement it in Spark Streaming programming model. Besides, we optimized the real-time algorithm based on Spark Streaming. Real-time algorithm is designed based on the theory "user preferences will change over time" : only use recent ratingto build predicting model for users, thus reduces the amount of data processing greatly. When implementing the algorithm in Spark Streaming, this paper improved the level of parallelism in data processing by set multiple message receivers and reset data partitions. Meanwhile, we use Spark broadcast mechanism to save algorithm execution time. Experiment results showed that the real-time recommendation algorithm in this paper has a good real-time data processing capability.(3) This paper also proposed a new HDFS storage policy based on Erasure Code “Redundancy First, Then Encode”, and designed a new HDFS architecture to support this storage policy which is called “HDFS with Erasure Code”. Redundant storage policy will waste too much disk space, but it's easy to recover data. While Erasure storage policy will economize disk space, but it's hard to recover data and will effect the efficiency of big data computing. New storage policy, as the paper designed, for the data without using in a long period of time, storing them by Erasure Code storage method; for others, storing them by Redundancy method. In this way, cloud storage system reaches a balance between disk space wasting and data recovery difficulty, and provides better services for big data storage in recommended system.
Keywords/Search Tags:big data, Spark, recommendation system, batch recommendation, real-time recommendation, erasure code storage
PDF Full Text Request
Related items