Font Size: a A A

Research On The Approach To Detecting Spam Page Ranking Based On Link Analysis

Posted on:2012-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:D Q FengFull Text:PDF
GTID:2178330335450442Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of the internet, peoples online activities are increasing everyday, according to the data reported, the numbers of netizen are increasing rapidly. Using search engines to find the information needed on the Internet has become the main way to retrieve information from the vast amounts of information in internet. The search engine crawls, storages and indexes all the information on the web, after having analyzed keywords submitted by the user, return the pages which are most relevant to those key words to the user according to some sort, but only the first few pages of the web are browed, thus for the web-owner, making his own web pages in the front position of the searched result means flow and cash, which leads to the appearance of so-called Search Engine Optimization(SEO), some SEO do not pay attention to promote the qualities of the pages but put emphasis on cheating by temptation and cheat sorting algorithm exploiting the loopholes and shortcomings in Search engine ranking algorithm. Web cheating impact a serious satisfaction on the user who use search engines, decreasing the confidence of the search engine services, and also increasing the operating costs of the search engine providers, Web cheating has become a major difficulty problem for the search engine to face.Sorting algorithm is divided into two major forms:algorithm based on link analysis and algorithm based on content. The most famous algorithms among it based on link are PageRank and HITS, Google is famous for adopting the PageRank algorithm. The main advantage of PageRank is high efficiency, but the main disadvantage is that the assessment of the importance of web pages is based solely on the link between relationships and ignoring the content of the page, gives the links to different pages of the same weight; the main disadvantage of HITS is the large iteration, and the easiness of causing theme drifting. Network topology is vulnerable to be increased, deleted, and destructed by deliberate cheat, PageRank and HITS algorithms own obviously insufficient in the detection of cheating page. For those shortcomings, many people combine the subject of web information, content information, time information and statistics methods, machine learning methods and the time-domain method to detect web cheating, making the page ranking results be improved in some degrees. But the main problem is that cheating detection algorithm requires a lot of network data, the initial data set will directly affect the effect of anti-cheating, the bad independence of the algorithm, poor convergence and it must be run many times to gain a relatively satisfactory results.In order to solve the problems above, this article, based on the TrustRank algorithm, points out an algorithm which is based on the pruning strategy and combined select seeds sets—CTR(Combined TrustRank). The main content is summarized as the followings:1. Aiming at the ubiquitous characteristics of pointing to each other and closely linking of cheating websites in interlinking farms, it points out an ANP(Automatic Node Pruned). Firstly, in the Web graph exist a lot of link farms, a pruning algorithm (ANP) is proposed on account of the cheat pages among those are pointed to each point and closely linked. According to the link structure of some web pages are significantly different from others, cheating suspect nodes are identified manually as seed nodes, basing on the hypothesis:"the points pointing to these known cheating nodes can also be cheating node", the nodes that pointing to these cheating nodes are extended to the seed set, at the same time, when cheating seeds set is being expanded, the nods whose threshold are larger than Tp are deleted. The whole process will not end until finish traversing all cheating nodes in the link farm and remove all the links of those cheating nodes. After pre-pruning, the numbers of cheating nodes are reduced in the Web graph and the overall computational complexity of the CTR is improved.2. To learn the characteristics of human relations in social network, a credible seed collection composed of authority sites is established. Quality of seed set will directly affect the CTR algorithm results, when seeds are selected, the inward and outward link information of the node should be took into a full account, that is, as a candidate nodes, the out-degree will not be too large, if it is greater than the threshold value, the quality of the outward link will be difficult to ensure, but it is not too small, for the authority of the trust value of the seed can be achieved only within a certain distance of the node, so that the trust value of the trusted nodes whose distance are faraway from seed set could be reduced too low to be innocent cheating node. The foregoing analysis results in a combined selection algorithm to select seed node.3. Public data sets WEBSPAM-UK2007 is used in the experiment, because of the data set contains five main topic, about seven or eight million pages and over three billion links, and has been artificially marked, data is high accurate. Total 8248 sites (7900 normal sites, 348 cheating sites), Compared experiment with traditional cheating detection algorithm has been made. For convenient comparison of the experimental results, first, all sites are placed on the 20 buckets in descending order, each bucket PageRank/TrustRank value accounts for 5 percent, second, cheating sites distribution are manually inspected. Statistics shows that SD value, Spam site detection rate, partiality of the field, the precision and recall rates of CTR algorithm are significantly better than traditional algorithms, and further verified the validity and practicality of the algorithm.In recent years, the research on Page Ranking and web spam detection approach has been concerned by many scholars at home and abroad, and lots of improved algorithms have emerged. In this paper, it studies on spam page ranking based on link analysis, proposes a more effective detection approach. This research work has some certain theoretical significance and practical value.
Keywords/Search Tags:Search Engine, Page Ranking, Spam Detection, Link Analysis
PDF Full Text Request
Related items