Kmeans Analysis Of Massive Book Circulation Data Based On Hadoop

Posted on:2016-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:X L Liu

Full Text:PDF

GTID:2308330479995242

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the era of data, we have to face various types of data all the time.However, these vast data are disorganized and let us helpless. How to find the regularity and valuable information behind the data timely and effectively becomes particularly important. Clustering analysis is an important technology in the field of data mining,through clustering we can find the relationship between the distribution patterns of data and data attribute, it has been a hot research topic in academia. In addition, the rise of Hadoop,an cloud computing platform, also makes data mining fast and efficient.In this paper, based on the research of the kmeans clustering algorithm,Two improvements are put forward for the existing defects of traditional algorithm. The first point is the determination of the initial clustering center, put forward a method of the combine of sampling and the maximum and minimum distance,and put forward a detection method based on grid and distance for the outlier problem in the process of research. The improved algorithm improves the execution efficiency and accuracy. The second improvement is to realize the parallel design of kmeans clustering algorithm in Hadoop.Realize the design of the two processes of Map and Reduce in MapReduce parallel programming model. Finally, by analyzing the speedup ratio and complexity of the improved algorithm to evaluate the performance of the algorithm, there are obvious improvement of clustering effect and execution efficiency of the improved kmeans clustering algorithm.In this paper, we regard the book circulation data as study object, and collecte the information of books and students, then standardize to clustering data, through different forms of clustering based on number and readers type finding that students in different tendency on book lending and a lot of valuable information. These information gained by clustering will has very good guiding significance to the books management and students’ learning.

Keywords/Search Tags:

kmeans, Hadoop, parallelization, isolated point, Grid

PDF Full Text Request

Related items

1	Improvement Of Collaborative Filtering Recommendation Algorithm And Its Parallelization On Hadoop Platform
2	Research On Parallelization Of Community Discovery Algorithm Based On Hadoop
3	Automatic Parallelization For Seismic Data Processing Programs On Grid Environment
4	Reach On Map-Reduce Application Based On Hadoop
5	Reach On Map-reduce Application Based On Hadoop
6	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
7	Parallelization For Serial Programs And Its Application On Desktop Grid
8	Parallel Implementation For Last Based On Hadoop Streaming
9	Research And Implementation Of Internet Public Opinion Analysis System Based On Hadoop
10	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop