Font Size: a A A

Research On Distributed Parallel Data Mining Algorithm Based On Weblog

Posted on:2018-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:P S GuoFull Text:PDF
GTID:2348330515451659Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In twenty-first Century,the rapid development of the Internet has brought great convenience to people's daily lives,and the various walks of life are moving closer to the Internet.Meanwhile,the behavior and the footprint of the user on the networkare recorded in the web log file.The web log files can be effectively analyzed by data mining,bring out a lot of valuable information.What the diff-erence between the ordinary text mining file and the web log file is that operati-ons of every hour and moment online lead to the size of the web log file so large.So analysis of the file by the serial data mining algorithm has been no longer suitable,and the parallel data mining algorithms has been popularized quickly in the web data mining field.Apache Hadoop,as the most mature parallel framework,has been widely used by developers in web data mining.In the aspect of algorithm,the parallel clustering algorithm is used to calculate the web log file,which can provide the basis for the optimization of thecontent structure and recommendation of the user's content.This thesis analyzes the web log files on the Hadoop platform,and researchs clustering algorithm.The main works need to complete are several aspects:1.Research knowledge about Hadoop and web data mining;2.Build the Hadoop distributed platform,not only the basic Hadoop platform,but also the installation of Mahout,the installation of the Hadoop plugin on eclipse,configuration and allocation of resources,the design of parallel preprocessing model;3.Research the advantages and disadvantages of Canopy and K-means clustering algorithm,and combine with the basic idea of the two clustering algorithm to propose an improved algorithm;4.Research parallelization of the algorithm,and use the Map Reduce algorithm model to design the parallel algorithm;5.Design comparative test,argue superiority of the improved algorithm and parallel algorithm and the value on pravtical application.
Keywords/Search Tags:Web Datamining, Hadoop, K-means, Canopy
PDF Full Text Request
Related items