Font Size: a A A

Research On Improvement Of PageRank And K_means Algorithm In Web Data Mining

Posted on:2020-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:L HuangFull Text:PDF
GTID:2428330578456742Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since the 1990 s,the Internet and the World Wide Web have developed rapidly,and their functions and services are expanding and increasing.This makes them become the main places for users to obtain resources,data and information in the 21 st century,which also makes the task of Web data mining imminent.At present,k_means algorithm is the most classical and widely used partition clustering algorithm,and PageRank algorithm is the most widely used algorithm in Web structure mining.Based on this,the principles of these two algorithms are studied and two improved methods are proposed.The traditional k_means algorithm chooses the initial clustering center by the random number method.This method can easily cause the clustering result to fall into the problem of local optimal solution and low clustering accuracy,and the clustering result is greatly affected by the outliers.To solve this problem,an improved k_means algorithm based on density standard deviation is proposed.Firstly,the average and standard deviation of data set samples are calculated,then the density distribution function of each data point is calculated,and the average density and density standard deviation of samples are calculated.If the density distribution function value of a data point is less than the density standard deviation of samples,it is divided into outliers.Searching for the maximum value in the array of density distribution function values,then the sample points corresponding to the maximum value are the initial clustering centers.The initial clustering centers are taken as the origin points,and the density function values of each point in the circle with the sample average as the radius are assigned to 0,which is repeated until K initial clustering centers are found.Traditional PageRank algorithm does not consider users' preferences and has the phenomenon of topic drift.In order to improve the two shortcomings of PageRank algorithm,an improved PageRank algorithm based on user preferences and topic links is proposed.The algorithm first calculates the authoritative value of the website's exit and entry,secondly calculates the probability of users visiting a website,then calculates the authoritative value of the website,then calculates the subject link vector of the page in the website,then calculates the similarity of the subject link vector,finally calculates the PR value of the page,ranks the pages in the website by the size of the PR value until all pages are sorted.Both traditional and improved algorithms are coded by Python language on PyCharm platform.The traditional algorithm and the improved algorithm are compared by experiments.The experimental results show that the improved k_means algorithm eliminates the influence of outliers and has higher accuracy and better clustering results.The improved PageRank algorithm can sort web pages according to users' preferences,which greatly improves users' experience and reduces the time for users to screen useful web pages by themselves.Different users can get different sorting results.Because this improved algorithm calculates PR value based on the similarity of topic links,it can improve the phenomenon of topic drift to a certain extent.The feasibility of the two improved algorithms is proved.
Keywords/Search Tags:Web data mining, k_means algorithm, PageRank algorithm, Python, PyCharm
PDF Full Text Request
Related items