Font Size: a A A

Reach On Map-Reduce Application Based On Hadoop

Posted on:2010-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:R T QiuFull Text:PDF
GTID:2178360308490759Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the swift development of internet and 3G, data is characterized by diverse, massive, heterogeneous and dynamic changing. An embarrassing situation which website operators often facing is"riching in data but lacking in knowledge", so how to mining valid information becomes significant to researcher.A prarallel programming framework of MapReduce programming model which based on hadoop platform was announced after analysising of related technologies such as distributed programming model, parallel computing, MapReduce programming model, Hadoop cluster technology and so on. The programming framework is characterized by using open source technic and the current popular distributed technologies to meeting parallel executeon of algorithms. The execution efficiency was enhanced by using it to improve Canopy-Kmeans algorithm. The framework also can be applied to a large number of other algorithms.Canopy-Kmeans improves the traditional Kmeans algorithm in two aspects: one is using Canopy to selecte initial cluster K centers. It can eliminate isolated points and improve the accuracy of clustering; the other is using a cheap,approximate distance measure to efficiently divide the data into overlapping subset called canopies.Then clustering is performed by measuring exact distances only between points that occur in a common canopy. It reduces computation time and improves efficiency.Hadoop platform have the virtues such as low cost, easy to maintain, scalable, easy to develop applications and so on. At the same time, the platform allow user without concurrent processing or distributed system programming experience to handle large data resources, as long as configure corresponding Hadoop API. Using MapReduce programming model can easy write distributed applications and simplifying distributed programming. At last, the algorithm is applied to data processing sites, as well as students'score statistics, MapReduc model is very efficiency in practice application.
Keywords/Search Tags:Hadoop, Map-Reduce, Canopy-Kmeans, Clustering, Distributed Computing
PDF Full Text Request
Related items