Font Size: a A A

Design And Implementation Of A Distributed Data Processing Platform For The Online Music Service

Posted on:2015-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:H DengFull Text:PDF
GTID:2298330422977159Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As internet develops rapidly and mobile client gains its popularity, all kinds of datagrow exponentially. How to dig out useful information from massive data has becomea hot topic. User’s play record generated by online music service is a typical examplefor massive data. Accompanied by the development of online music, people lovelistening to music via all types of music play tool. Each user’s play record will be keptintact and preserved by the online music service provider, through exploring theseplay records we shall find out individual preference to music from every user category.However, current data processing platform is not adequate to fulfill this demand.Therefore, how to exploit information from masses of user’s play records remains tobe a great challenge and difficulty.In order to extract useful information from massive play records, this dissertationproposed and realized a distributed data processing platform (KGMiner) for onlinemusic service, which is mainly focused on processing Kugou Music data, includingpreprocessing, clustering and hot items extraction. Meanwhile, this essay provides adefinition and abstraction for the standard procedures during preprocessing andclustering, which makes it convenient for data analyst to develop extensions based ondifferent demands. KGMiner is using the most sophisticated big data processingframework-Hadoop, it is expected to complete data mining work of user’s play recordfrom Kugou Music.However, in practical application, I noticed many deficiencies in iterationcomputation of distributed K-means algorithm based on Hadoop. For example,random selections of initial points, lengthy start time, long reduce time, etc. Therefore,necessary improvement on this paper mainly involves efficiency optimization on iteration computation of distributed K-means algorithm.Improvement work can be divided into three parts. First is to revise the randomlyselected initial points from K-means algorithm. On reference of ideas fromK-means++, I select distant points as initial points for reducing iteration times.Secondly, after viewing the jobs operating serially, I come up with a asynchronousstart method and it reduced the starting time as a part of processing time. Finally,given the fact that the majority of the reduce procedure time is spent on frameworkbooting of reduce end rather than computation, I implemented a new reduce operatingmode called MyReduce, which keeps receiving data and calculating the global centers,this leads to effective avoidance of time consumption while computing global centersof gravity caused by the framework. In conclusion, the experimental result which isbased on real Kugou music data did suggest that the improvement method discussedin this paper can reduce total time of distributed clustering to a very high degreecompared to K-means distributed clustering.
Keywords/Search Tags:Big Data, Online Music Service, Data Processing, Hadoop, ProcessOptimization
PDF Full Text Request
Related items