Font Size: a A A

Research On Web Structure Mining Algorithm Based On Cloud Computing

Posted on:2011-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:X GaoFull Text:PDF
GTID:2178360305460119Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Web structure mining is trying to study the link structures among web pages to detect the organization structure of network and the knowledge hidden in link relations. Along with the development of internet, the massive data analysis and mining of network are faced with bottleneck in computing power and storage aspects. As the hotspot of recent study, cloud computing is the development of grid computing, parallel computing and distributing computing. It effectively analyses and solves massive data, and not only reduces the requirement of terminal equipment but also improves the ability of data processing.On the basis of studying pagerank algorithm and Mapreduce parallel programming model, the paper has done these works: 1. It combines pagerank algorithm with Mapreduce programming model, and tests the performance of pagrank algorithm based on Mapreduce under different datasets.2. In consideration of the problems of pagerank based on Mapreduce when running large datasets, this paper put two improvements. First, the paper implies idea of matrix patition to reduce the time consumption in mixing and sorting period of pagerank for each iteration. Second, on the basis of increase the algorithm's span for each iteration to reduce the iteration number which can reduce network communication and I/O consumption of HDFS accessing related to iteration number.3. It constructs cloud environment by Hadoop, and for different BlockSize analyses the impact to computing capability. Finally, it tests and compares the performances of the three algorithms under different web datasets. The result proves that the improved algorithms have advantages at space usage and iteration time aspacts.
Keywords/Search Tags:Cloud computing, Web structure mining, Mapreduce, pagerank algorithm, Hadoop, distribution computing
PDF Full Text Request
Related items