Design And Implementation Of A Distributed Data Processing Platform For The Online Music Service

Posted on:2015-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:H Deng

Full Text:PDF

GTID:2298330422977159

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As internet develops rapidly and mobile client gains its popularity, all kinds of datagrow exponentially. How to dig out useful information from massive data has becomea hot topic. User’s play record generated by online music service is a typical examplefor massive data. Accompanied by the development of online music, people lovelistening to music via all types of music play tool. Each user’s play record will be keptintact and preserved by the online music service provider, through exploring theseplay records we shall find out individual preference to music from every user category.However, current data processing platform is not adequate to fulfill this demand.Therefore, how to exploit information from masses of user’s play records remains tobe a great challenge and difficulty.In order to extract useful information from massive play records, this dissertationproposed and realized a distributed data processing platform (KGMiner) for onlinemusic service, which is mainly focused on processing Kugou Music data, includingpreprocessing, clustering and hot items extraction. Meanwhile, this essay provides adefinition and abstraction for the standard procedures during preprocessing andclustering, which makes it convenient for data analyst to develop extensions based ondifferent demands. KGMiner is using the most sophisticated big data processingframework-Hadoop, it is expected to complete data mining work of user’s play recordfrom Kugou Music.However, in practical application, I noticed many deficiencies in iterationcomputation of distributed K-means algorithm based on Hadoop. For example,random selections of initial points, lengthy start time, long reduce time, etc. Therefore,necessary improvement on this paper mainly involves efficiency optimization on iteration computation of distributed K-means algorithm.Improvement work can be divided into three parts. First is to revise the randomlyselected initial points from K-means algorithm. On reference of ideas fromK-means++, I select distant points as initial points for reducing iteration times.Secondly, after viewing the jobs operating serially, I come up with a asynchronousstart method and it reduced the starting time as a part of processing time. Finally,given the fact that the majority of the reduce procedure time is spent on frameworkbooting of reduce end rather than computation, I implemented a new reduce operatingmode called MyReduce, which keeps receiving data and calculating the global centers,this leads to effective avoidance of time consumption while computing global centersof gravity caused by the framework. In conclusion, the experimental result which isbased on real Kugou music data did suggest that the improvement method discussedin this paper can reduce total time of distributed clustering to a very high degreecompared to K-means distributed clustering.

Keywords/Search Tags:

Big Data, Online Music Service, Data Processing, Hadoop, ProcessOptimization

PDF Full Text Request

Related items

1	Design And Implementation Of Online Data Processing System Based On Hadoop
2	Research And Implementation Of Marine Information OLAP And Data Mining System Based On Hadoop
3	Key Technologies On Structural Feature Based Music Resizing
4	Research On Key Technologies Of Massive Network Data Processing Platform Based On Hadoop
5	Platform Development On Massive Data Collection And Processing Based On Hadoop
6	Research On Big Data Processing System Based On MapReduce Parallel Processing Framework
7	Key Technology Research-based The Hadoop Of Massive Data Processing
8	Research And Application On Big Data Processing Based On Hadoop Platform
9	Design And Implementation Of Big Data Processing Platform Based On Hadoop
10	Researcn And Application Of Data Processing Based On Hadoop