Font Size: a A A

Research On Web Spam Detection And Web Page Sorting

Posted on:2013-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:B B YuFull Text:PDF
GTID:2248330395455311Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web Spam, ignoring the real value of web pages, is intended to implement unfairrelevance or significance. Web spam not only becomes a threat to the justice of searchengine sorting, but also seriously affects the users’ search experience. Therefore, it hasbecome a hot research topic on how to detect Web Spam by classification and get pagesorting based on correlative content. This paper focuses on Web Spam classification andweb page sorting, and the main work is as follows:First, the effect of the content and link feature attributes on Web Spam is analyzed,and a new feature attribute is proposed to overcome the shortcoming caused by onlycontents or links as classification feature attributes. The new feature attributes areobtained by calculating similarity of different elements of two pages, based onrelationship between content and link. Then, content, link and a new feature attribute arecombined as a new classification feature attribute. Finally, to deal with the imbalance ofdatasets, a cost-sensitive method combined with the decision tree algorithm C4.5is usedto Web Spam classification. It is shown that the cost-sensitive classification algorithm isbetter than the decision tree algorithm C4.5by experiments on a public datasetsWebSpam-UK2007, and the feasibility and validity of algorithm are verified.Second, an improved PageRank algorithm is proposed to overcome the shortagesof PageRank algorithm in favor of old pages and apt to make topic-drift. In theimproved algorithm, a time weight factor is given to overcome the shortcoming of favorof old pages; a similarity weight factor is given to overcome topic-drift; and a factor isadded to improved algorithm to resist spam. Finally, compared with the classicPageRank algorithm, the improved algorithm shows a great improvement in searchingrelevant web pages.
Keywords/Search Tags:Web Spam, Decision tree, Cost-sensitive, PageRank
PDF Full Text Request
Related items