Font Size: a A A

Research On Webpage Recognition Technology Based On Vision And Semantics

Posted on:2021-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z ZhangFull Text:PDF
GTID:2428330614450026Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
With the vigorous development of Internet technology,more and more users are participating in it.However,while the Internet is enriching people's lives,there are also worrisome problems.For example,a large number of undesirable web pages are flooding the Internet,which poses a huge threat to people's physical and mental health and property security.How to identify these bad web pages has always been the top priority of cybersecurity practitioners.The existing webpage recognition technology has certain detection effects on some simple illegal websites,but it is not yet possible to completely,timely and effectively detect badly spreading bad webpages with strong concealment.From a human perspective,we can easily identify which web pages are similar and which are not similar.This is due to our understanding of the visual and semantic information of the web page.This paper is mainly based on the way humans understand web pages,and attempts to extract the semantic and visual features of web page content to identify the similarity of web pages.For the recognition of semantic similarity of web pages,this paper proposes a technical solution based on word2 vec word vectors.Specifically,for each webpage,after preprocessing,it extracts keywords based on TF-IDF as webpage summary information,and then uses word2 vec to map these summary information into the word vector space to generate webpage feature vectors.Finally,we The cosine similarity is used to compare the similarity of two webpage feature vectors.This paper designed a webpage clustering experiment to prove the effectiveness of the semantic similarity calculation scheme proposed in this paper.This article collected the wiki encyclopedia Chinese corpus to train word vectors,and at the same time collected the Sogou news corpus and extracted 8 types of data as clustering data.Finally,the semantic similarity calculation method in this paper successfully clusters the data into 8 categories,and the purity within the category is high.For web page visual similarity recognition,this paper uses web page block technology to divide the web page visual blocks,and then for each visual block,the visual feature extraction based on perceptual hash is performed,and then the visual tree is reconstructed to visualize the web page.The block set is reconstructed into a visual tree.Each tree node stores the visual characteristics of the visual block.Finally,for the visual tree generated by each web page,this paper proposes a visual similarity calculation method based on the Hamming distance and the edit distance of the tree.Calculate it.In this paper,three experiments are designed to verify the effectiveness of the visual similarity calculation method in this paper,which are webpage grouping experiment,webpage clustering experiment,and webpage recognition experiment.The webpage grouping experiment collected 8 real webpages and their corresponding phishing websites,a total of 12 groups of webpages;the webpage clustering experiment collected the multi-year changes of the homepages of multiple websites as a clustering data set,a total of 71 webpages,12 categories;In the webpage recognition experiment,1051 webpages were collected,of which 51 were target webpages and 1,000 other webpages.The final three experiments have achieved good results,proving the effectiveness of the proposed visual similarity recognition scheme.
Keywords/Search Tags:Web page recognition, Web page semantic similarity, Web page visual similarity, Web page clustering
PDF Full Text Request
Related items