Font Size: a A A

Mining The Link Spamming And Malicious Web Pages Based On Topology Structure Of Massive Internet Web Pages

Posted on:2018-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:J N WeiFull Text:PDF
GTID:2348330515476450Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The World Wide Web provides a wealth of information that anyone can access.In order to identify a large number of pages of the most useful information,users mainly rely on search engine.Search engine typically categorizes a large number of web pages and present pages that are most relevant to the user queries for which the estimated relevance and ranking of their popularity are ranked.Users typically visit the top-ranked page,ignoring the rest.Therefore,in order to attract more users,for each page in the search engine results ranking is very important.Search engine is the primary means to help users find what they need.In order to provide recommendations for the user's query and give the most relevant and popular web pages,search engine assigns rankings to each page based on certain algorithms,which generally increase with the number and ranking of other sites linked to the page.However,link spammers have developed several techniques to deal with these algorithms and improve their own page rankings.These techniques typically build relationships between link spammers to improve their page rank in search engine results,based on underground links for collaborative link switching.This paper investigates how to identify malicious links and malicious web pages,gather web pages on the Internet and hypertext links between them on top of massive Internet nodes and edges,construct an Internet topology,study and analyze the links of these cheating links.Figure in the topology of the features by extending to track these malicious links to point to identify the malicious Web pages in the Internet.In this paper,we analyze and summarize the characteristics of malicious web pages and spam link topology,predict the characteristics of spamming link topology,and propose the topology structure based on the topological structure of spam web pages and spam links.Spam links and malicious Web pages mining model,and in this model a simple but efficient seed node acquisition and expansion algorithm is proposed.When you expand the seed set,you can find some pages in the linked farm as seed sets,and for each new page,if the page has multiple inbound and outbound links from and to,this page is likely Same link to the farm as part of the seed set.You can then expand the seed set by adding a new page.After obtaining the seed set,we need to extend the steps tofind more bad pages in the data set to establish the spam link topology.When you perform an expansion step,if a page points to a bad set of pages,it is likely that the page itself is bad.So expanding from one page to the linked page,although it follows inbound links instead of outbound links.In order to verify the performance of the model proposed in this paper,we use the Python reptile module to mine Web pages.The experimental data are divided into three groups according to the crawling time,which are 95,000 pages.These pages are located in8452 different domains.The total number of cheat-tagged web pages is 6208,and 180 seed nodes are obtained.The experimental results show that the accuracy of the proposed model is83.3%,which is basically the purpose of detecting cheat web pages and linking farms.And the cheat link topology constructed by the experimental data is basically consistent with the predicted topology structure of the cheat link topology,which proves that the conjecture of the cheat link topology in this paper is basically correct.Further,by tracking the direction of these cheating links to find the malicious Web pages they serve,and these pages to report or publicity,so as to reduce the exposure of these malicious Web pages in the search engines the chance to maintain Internet security.
Keywords/Search Tags:network security, link spamming, malicious web pages, link farm
PDF Full Text Request
Related items