Font Size: a A A

Kmeans Analysis Of Massive Book Circulation Data Based On Hadoop

Posted on:2016-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiuFull Text:PDF
GTID:2308330479995242Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of data, we have to face various types of data all the time.However, these vast data are disorganized and let us helpless. How to find the regularity and valuable information behind the data timely and effectively becomes particularly important. Clustering analysis is an important technology in the field of data mining,through clustering we can find the relationship between the distribution patterns of data and data attribute, it has been a hot research topic in academia. In addition, the rise of Hadoop,an cloud computing platform, also makes data mining fast and efficient.In this paper, based on the research of the kmeans clustering algorithm,Two improvements are put forward for the existing defects of traditional algorithm. The first point is the determination of the initial clustering center, put forward a method of the combine of sampling and the maximum and minimum distance,and put forward a detection method based on grid and distance for the outlier problem in the process of research. The improved algorithm improves the execution efficiency and accuracy. The second improvement is to realize the parallel design of kmeans clustering algorithm in Hadoop.Realize the design of the two processes of Map and Reduce in MapReduce parallel programming model. Finally, by analyzing the speedup ratio and complexity of the improved algorithm to evaluate the performance of the algorithm, there are obvious improvement of clustering effect and execution efficiency of the improved kmeans clustering algorithm.In this paper, we regard the book circulation data as study object, and collecte the information of books and students, then standardize to clustering data, through different forms of clustering based on number and readers type finding that students in different tendency on book lending and a lot of valuable information. These information gained by clustering will has very good guiding significance to the books management and students’ learning.
Keywords/Search Tags:kmeans, Hadoop, parallelization, isolated point, Grid
PDF Full Text Request
Related items