Research Of WEB Structure Mining Technologies Based On Link Similarity Analysis

Posted on:2013-10-29

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y S Zhang

Full Text:PDF

GTID:1268330425466997

Subject:Computer application technology

Abstract/Summary:

In recent years, WEB services have been rapidly developed, the information of WEB isgrowing exponentially, every day tens of millions of WEB pages are created. WEB pageshave been involved in education, government, e-commerce, news, advertising, consumerinformation, financial management and many other services. These WEB pages are becominga huge, widely distributed, global information service center. The links of WEB pages havedeep-rooted network like human society. Link analysis methods are useful to find the regularpattern of WEB resources which will help people to efficiently gain required information andfind the required field of authoritative information.This dissertation focuses on the four key technologies about WEB mining: WEB linkprediction, WEB SPAM page recognition, WEB structure mining and WEB page clusteringalgorithm.Firstly, a new similarity definition is introduced and then multipath walk (MW) linkprediction algorithm based on the similarity is proposed.1) The new attenuation factor whichis used to define the new similarity formula is put forward;2) Rubinâ€™s algorithm is improvedand combined with new similarity formula to get the similarity of nodes;3) The node isranked using the similarity and then we can get the results of link prediction in the data set ofactual network. Finally, experiment valid the algorithm.Secondly, the definition of the new link node similarity based on page link is proposedand then the assumptions about the link structure of the Spam page are made. The link-basedsimilarity clustering Spam page recognition algorithm is put forward. The algorithm considersthe connection between the two Web pages which make it reasonable. The experimentverified the earlier assumptions and validated the efficiency of the algorithm.Thirdly, an improved PageRank algorithm based on the link text similarity and timefactor is proposed against â€œtopic driftâ€ and â€œlaying particular stress on old pagesâ€ phenomena.For the first step, the similarity between the link text and its landing page with the vectorspace model is calculated. Afterwards, a reasonable time feedback factor with the pagegenerated date is designed as to weaken the PageRank of old pages and improves thePageRank of new pages. Finally, we integrate the similarity and time feedback factor to the PageRank algorithm to calculate the PageRank of WEB pages according to the improvedalgorithm process and analyze the performance of this algorithm.Fourthly, heuristic clustering method based on local information is introduced, and thenthe label propagation method based on local information is summarized, and the problem ofthe iterative process and using a random strategy to select a node belongs to the clusterstructure are analyzed. Label propagation algorithm base on the similarity of node attributes isimproved. At last, the experiments are used to help to discovery the efficient and availabilityof the algorithm, and put the algorithm into preliminary application.Finally, the conclusion is drawn and the future research work is prospected.

Keywords/Search Tags:

Related items

1	Link Analysis Based Page Ranking Improvement And Related Link Spam Algorithm
2	The Research On Web Structure Mining And High Dimensional Data Mining
3	Research Of Social Network Data Mining Algorithm Based On Graph Clustering
4	Research On The Structure Mining Algorithms For Online Networks And Their Applications
5	Spam Filtering Techniques, Based On Data Mining
6	Study On The Clustering Of Large Scale WSN Based On Local Similarity Of Network Structure
7	Bayes Data Mining Technique And Its Application In Anti-Spam
8	Research On Web Structure Mining
9	Research On Method Of Video Structure Mining Based On Content
10	A Research Of Friendship Prediction System On Location-based Social Networks