Research On Distributed Parallel Data Mining Algorithm Based On Weblog

Posted on:2018-01-13

Degree:Master

Type:Thesis

Country:China

Candidate:P S Guo

Full Text:PDF

GTID:2348330515451659

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

In twenty-first Century,the rapid development of the Internet has brought great convenience to people’s daily lives,and the various walks of life are moving closer to the Internet.Meanwhile,the behavior and the footprint of the user on the networkare recorded in the web log file.The web log files can be effectively analyzed by data mining,bring out a lot of valuable information.What the diff-erence between the ordinary text mining file and the web log file is that operati-ons of every hour and moment online lead to the size of the web log file so large.So analysis of the file by the serial data mining algorithm has been no longer suitable,and the parallel data mining algorithms has been popularized quickly in the web data mining field.Apache Hadoop,as the most mature parallel framework,has been widely used by developers in web data mining.In the aspect of algorithm,the parallel clustering algorithm is used to calculate the web log file,which can provide the basis for the optimization of thecontent structure and recommendation of the user’s content.This thesis analyzes the web log files on the Hadoop platform,and researchs clustering algorithm.The main works need to complete are several aspects:1.Research knowledge about Hadoop and web data mining;2.Build the Hadoop distributed platform,not only the basic Hadoop platform,but also the installation of Mahout,the installation of the Hadoop plugin on eclipse,configuration and allocation of resources,the design of parallel preprocessing model;3.Research the advantages and disadvantages of Canopy and K-means clustering algorithm,and combine with the basic idea of the two clustering algorithm to propose an improved algorithm;4.Research parallelization of the algorithm,and use the Map Reduce algorithm model to design the parallel algorithm;5.Design comparative test,argue superiority of the improved algorithm and parallel algorithm and the value on pravtical application.

Keywords/Search Tags:

Web Datamining, Hadoop, K-means, Canopy

PDF Full Text Request

Related items

1	Research On Hot Topics Discovery In Microblog Based On Distributed K-means Algorithms
2	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Research And Implementation Of College Students' Identification Of Poor Students Based On Hadoop Platform
4	Research On Clustering Algorithm On Hadoop Platform
5	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
6	Research On The Application Of User Behavior Analysis Based On Hadoop
7	High Dimensional Fuzzy C-Means Clustering Recommendation Algorithm Based On Density Canopy
8	Analysis And Research On User Online Shopping Behavior Based On Hadoop
9	Design And Development Of Orderly Electricity Management System Based On Hadoop
10	Research And Design Of Automatic Clustering Based On Massive Scientific Literature