Font Size: a A A

Web Page Importance Ranking With Priori Knowledge

Posted on:2008-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:G Y ShenFull Text:PDF
GTID:2298330422989324Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The Internet has been a major way for people to get information, and the searchengines are powerful tools for people to find information they want in the Internet.However, on the one hand, there are no rules for the information publication andupdate in the Internet so that there is many redundant information or informationbehind the time. On the other hand, driven by economic interest, some organizationsand individuals try to manipulate the rank in the search results returned by searchengines. The above two reasons make it hard for search engines to provide goodexperience when users using search engines. The manipulation to the rank in thesearch results which is called “Web spam” is more serious. The rank of search resultsreturned by search engines is organically combined by relativity rank and importancerank. The Web spam aiming at importance rank is tougher than those aiming atrelative rank. In order to detect Web spam and make the rank of search result morecorrect, we propose the Web page importance ranking with priori knowledge system.In our system, firstly we have to detect Web spam. The current Web spamdetection algorithms can only detect some ways of web spamming and will fail whenthe spammers change the way of spamming. In order to avoid this situation, we usetemporal information to detect Web spam. This is based on the observation that thespam pages and common pages will have different evolution pattern along the timeline which can be seen by search engines. We can extract several temporal featuresand use boosting algorithm to classify Web pages. The experiments prove our systemis effective.In our system there is the other important part which is the improvement ofPageRank algorithm. We call it the algorithm of Web page importance ranking withpriori knowledge. The priori knowledge is provided by Web spam detection systemand some other simple technologies. Due to search engines’ insufficient and biasedcrawl of Web, the Web graph search engines get is not corresponding to the real Webgraph so that the PageRank algorithm can not get the real importance rank of Web pages. We use some priori knowledge to adjust the Web graph got by search enginesand try to get real importance rank of Web page. Our algorithm uses the prioriknowledge as restriction and minimizes the adjusting to the Web graph. Thetheoretical derivation proves that our method is reasonable and the initial experimentshows that it is effective.
Keywords/Search Tags:search engine, Web spam detection, PageRank algorithm, Web page importance ranking
PDF Full Text Request
Related items