Research On Improvement Of PageRank And K_means Algorithm In Web Data Mining

Posted on:2020-01-01

Degree:Master

Type:Thesis

Country:China

Candidate:L Huang

Full Text:PDF

GTID:2428330578456742

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Since the 1990 s,the Internet and the World Wide Web have developed rapidly,and their functions and services are expanding and increasing.This makes them become the main places for users to obtain resources,data and information in the 21 st century,which also makes the task of Web data mining imminent.At present,k_means algorithm is the most classical and widely used partition clustering algorithm,and PageRank algorithm is the most widely used algorithm in Web structure mining.Based on this,the principles of these two algorithms are studied and two improved methods are proposed.The traditional k_means algorithm chooses the initial clustering center by the random number method.This method can easily cause the clustering result to fall into the problem of local optimal solution and low clustering accuracy,and the clustering result is greatly affected by the outliers.To solve this problem,an improved k_means algorithm based on density standard deviation is proposed.Firstly,the average and standard deviation of data set samples are calculated,then the density distribution function of each data point is calculated,and the average density and density standard deviation of samples are calculated.If the density distribution function value of a data point is less than the density standard deviation of samples,it is divided into outliers.Searching for the maximum value in the array of density distribution function values,then the sample points corresponding to the maximum value are the initial clustering centers.The initial clustering centers are taken as the origin points,and the density function values of each point in the circle with the sample average as the radius are assigned to 0,which is repeated until K initial clustering centers are found.Traditional PageRank algorithm does not consider users' preferences and has the phenomenon of topic drift.In order to improve the two shortcomings of PageRank algorithm,an improved PageRank algorithm based on user preferences and topic links is proposed.The algorithm first calculates the authoritative value of the website's exit and entry,secondly calculates the probability of users visiting a website,then calculates the authoritative value of the website,then calculates the subject link vector of the page in the website,then calculates the similarity of the subject link vector,finally calculates the PR value of the page,ranks the pages in the website by the size of the PR value until all pages are sorted.Both traditional and improved algorithms are coded by Python language on PyCharm platform.The traditional algorithm and the improved algorithm are compared by experiments.The experimental results show that the improved k_means algorithm eliminates the influence of outliers and has higher accuracy and better clustering results.The improved PageRank algorithm can sort web pages according to users' preferences,which greatly improves users' experience and reduces the time for users to screen useful web pages by themselves.Different users can get different sorting results.Because this improved algorithm calculates PR value based on the similarity of topic links,it can improve the phenomenon of topic drift to a certain extent.The feasibility of the two improved algorithms is proved.

Keywords/Search Tags:

Web data mining, k_means algorithm, PageRank algorithm, Python, PyCharm

PDF Full Text Request

Related items

1	Research And Application Of K_means Algorithm And Swarm Intelligence Algorithm(PSO)Fusion
2	Research And Improved Of PageRank Algorithm In Web Data Mining
3	Research Of PageRank Algorithm In Web Structure Mining
4	Research And Implementation On WEB Data Mining Technology Based On Python
5	Based On PageRank Algorithm Of Web Data Mining
6	Research On An Improved Clustering Algorithm Of K_means
7	Research On Data Mining Based On Campus Card System
8	Research Of The PageRank Algorithm In Web Structure Mining
9	Research On“Expert Robot” Based On Big Data Processing Technology
10	Research On K_means Clustering Algorithm Based On MapReduce