Font Size: a A A

Research Of Web Clustering Based On MapReduce

Posted on:2012-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z H YuFull Text:PDF
GTID:2218330368482096Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the popularity of network applications and network information is growing rapidly, in the flood of data to obtain useful knowledge becomes more and more important for people. Through a long period of research and exploration, data mining techniques have been proposed, which is a multi-disciplinary cross, comprehensive discipline, this technology can extract the desired knowledge for users effectively. Clustering analysis is one of the most important part and the basic tools in data mining.The series growth of data and complexity of application development impede the development of multi-core processors and multi-processor system seriously, thus can not effectively use the data. The classic approach is to develop a distributed system with the message passing interface (MPI), which only provides fine-grained control by implementing parallel applications. Therefore, the abstraction and complexity of this method are out of the existing computing ability. Map/Reduce programming framework provides a higher abstraction than MPI, can be used in many data-intensive batch processing tasks, of which the abstraction and complexity are not higher than the present computing ability. Based on the study about distributed computing and Map/Reduce programming framework, this framework is improved, and computing ability of the improved framework is analyzed in theory. MRK-Means by iterating calculation, which performs multiple Map/Reduce operation, meanwhile, this improved Map/Reduce programming framework combines attribute property of web, such as massive, dynamic, fast updates, to explore OMRK-Means with attribute property based on Map/Reduce programming framework, which aim for increasing the scalability of online clustering method, reducing time of clustering, improving clustering accuracy.In ensuring the implementation, the experiment shows that OMRK-Means is faster than the traditional clustering algorithm on clustering, such as convergence and time analysis, precision analysis and scalability analysis. It indicates that the proposed method can speed up the interactive analysis of large data sets, improve clustering accuracy and be good for scalable web mining in the case of parallel incremental data.
Keywords/Search Tags:Cluster, Parallel, Map/Reduce, Web
PDF Full Text Request
Related items