Font Size: a A A

Research Of WEB Structure Mining Technologies Based On Link Similarity Analysis

Posted on:2013-10-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S ZhangFull Text:PDF
GTID:1268330425466997Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, WEB services have been rapidly developed, the information of WEB isgrowing exponentially, every day tens of millions of WEB pages are created. WEB pageshave been involved in education, government, e-commerce, news, advertising, consumerinformation, financial management and many other services. These WEB pages are becominga huge, widely distributed, global information service center. The links of WEB pages havedeep-rooted network like human society. Link analysis methods are useful to find the regularpattern of WEB resources which will help people to efficiently gain required information andfind the required field of authoritative information.This dissertation focuses on the four key technologies about WEB mining: WEB linkprediction, WEB SPAM page recognition, WEB structure mining and WEB page clusteringalgorithm.Firstly, a new similarity definition is introduced and then multipath walk (MW) linkprediction algorithm based on the similarity is proposed.1) The new attenuation factor whichis used to define the new similarity formula is put forward;2) Rubin’s algorithm is improvedand combined with new similarity formula to get the similarity of nodes;3) The node isranked using the similarity and then we can get the results of link prediction in the data set ofactual network. Finally, experiment valid the algorithm.Secondly, the definition of the new link node similarity based on page link is proposedand then the assumptions about the link structure of the Spam page are made. The link-basedsimilarity clustering Spam page recognition algorithm is put forward. The algorithm considersthe connection between the two Web pages which make it reasonable. The experimentverified the earlier assumptions and validated the efficiency of the algorithm.Thirdly, an improved PageRank algorithm based on the link text similarity and timefactor is proposed against “topic drift” and “laying particular stress on old pages” phenomena.For the first step, the similarity between the link text and its landing page with the vectorspace model is calculated. Afterwards, a reasonable time feedback factor with the pagegenerated date is designed as to weaken the PageRank of old pages and improves thePageRank of new pages. Finally, we integrate the similarity and time feedback factor to the PageRank algorithm to calculate the PageRank of WEB pages according to the improvedalgorithm process and analyze the performance of this algorithm.Fourthly, heuristic clustering method based on local information is introduced, and thenthe label propagation method based on local information is summarized, and the problem ofthe iterative process and using a random strategy to select a node belongs to the clusterstructure are analyzed. Label propagation algorithm base on the similarity of node attributes isimproved. At last, the experiments are used to help to discovery the efficient and availabilityof the algorithm, and put the algorithm into preliminary application.Finally, the conclusion is drawn and the future research work is prospected.
Keywords/Search Tags:WEB Mining, Similarity analysis, Link prediction, SPAM pageidentification, Structure mining, Clustering technique
PDF Full Text Request
Related items