Font Size: a A A

Research On Parallelization Of K - Means Clustering Algorithm Based On MapReduce

Posted on:2016-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2208330470966818Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering algorithm is to divide data into different classes or clusters. It can help market analysis distinguish different consumer groups from the consumer database in business, so the same group has the same user behavior and preferences. Clustering algorithm mainly includes the division methods such as k-means algorithm, hierarchical method, method based on density and so on. The traditional clustering algorithm has been able to successfully solve the problem of low dimensional data clustering. But with the advent of the era of big data, the data is not only a large amount of data but alse has various types. The data analysis based on these data is more difficult. With the help of a distributed framework the efficiency of algorithm has improved. In 2008 the Hadoop distributed computing framework makes data mining algorithm can be migrated to the distributed platform, the reliability and scalability of MapReduce programming framework make the mining algorithm can handle large amounts of data. Web log mining information can acquire the user’s behavior through analyzing the log. Performance of algorithm running in the distributed computing framework can be improved. Also it can provide help to personalize web service.This paper use MapReduce algorithm to process web log with the help of Hadoop distributed framework and mainly do the following work.1 Design of Hbase web log storage format:Through the research of Web log, web logs formats are different and the network log storage is also more complex. The network log collection system can be directly get data from the web log file, relational database and so on. The data can be stored in HDFS and Hbase distributed database through simple treatment. In this paper,we design web log format stored in Hbase.2 Analysis of web log based on MapReduce:MapReduce as a model can be used to handle large data in a distributed environment. Through MapReduce this paper analyses website user behavior patterns, and provides guidance for the design of the architecture of web site.3 Clustering algorithm based on MapReduce:Through the research of K-means clustering algorithm, k-means can be realized by the distributed framework. Also the min max algorithm can be used to ensured the initial input k and it solves the blind guess problem for the large amount user data to ensure the clusters k.Based on the aboved study, this paper uses the NASA site access log information as experiment data. The result shows that MapReduce algorithm is effective for the log analysis and the clustering algorithm.
Keywords/Search Tags:Log mining, Web log, k-means, MapReduce, Clustering
PDF Full Text Request
Related items