Research On Webpage Recognition Technology Based On Vision And Semantics

Posted on:2021-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Zhang

Full Text:PDF

GTID:2428330614450026

Subject:Cyberspace security

Abstract/Summary:

With the vigorous development of Internet technology,more and more users are participating in it.However,while the Internet is enriching people's lives,there are also worrisome problems.For example,a large number of undesirable web pages are flooding the Internet,which poses a huge threat to people's physical and mental health and property security.How to identify these bad web pages has always been the top priority of cybersecurity practitioners.The existing webpage recognition technology has certain detection effects on some simple illegal websites,but it is not yet possible to completely,timely and effectively detect badly spreading bad webpages with strong concealment.From a human perspective,we can easily identify which web pages are similar and which are not similar.This is due to our understanding of the visual and semantic information of the web page.This paper is mainly based on the way humans understand web pages,and attempts to extract the semantic and visual features of web page content to identify the similarity of web pages.For the recognition of semantic similarity of web pages,this paper proposes a technical solution based on word2 vec word vectors.Specifically,for each webpage,after preprocessing,it extracts keywords based on TF-IDF as webpage summary information,and then uses word2 vec to map these summary information into the word vector space to generate webpage feature vectors.Finally,we The cosine similarity is used to compare the similarity of two webpage feature vectors.This paper designed a webpage clustering experiment to prove the effectiveness of the semantic similarity calculation scheme proposed in this paper.This article collected the wiki encyclopedia Chinese corpus to train word vectors,and at the same time collected the Sogou news corpus and extracted 8 types of data as clustering data.Finally,the semantic similarity calculation method in this paper successfully clusters the data into 8 categories,and the purity within the category is high.For web page visual similarity recognition,this paper uses web page block technology to divide the web page visual blocks,and then for each visual block,the visual feature extraction based on perceptual hash is performed,and then the visual tree is reconstructed to visualize the web page.The block set is reconstructed into a visual tree.Each tree node stores the visual characteristics of the visual block.Finally,for the visual tree generated by each web page,this paper proposes a visual similarity calculation method based on the Hamming distance and the edit distance of the tree.Calculate it.In this paper,three experiments are designed to verify the effectiveness of the visual similarity calculation method in this paper,which are webpage grouping experiment,webpage clustering experiment,and webpage recognition experiment.The webpage grouping experiment collected 8 real webpages and their corresponding phishing websites,a total of 12 groups of webpages;the webpage clustering experiment collected the multi-year changes of the homepages of multiple websites as a clustering data set,a total of 71 webpages,12 categories;In the webpage recognition experiment,1051 webpages were collected,of which 51 were target webpages and 1,000 other webpages.The final three experiments have achieved good results,proving the effectiveness of the proposed visual similarity recognition scheme.

Keywords/Search Tags:

Web page recognition, Web page semantic similarity, Web page visual similarity, Web page clustering

Related items

1	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
2	A Study Of Hybrid Cache Management Mechanism Based On Page Classifier And Page Placer
3	Research On Mining Structure Of WEB Page For Information Extraction
4	Study On Web Data Processing Technology
5	Study On Web Page Watermarking
6	Research Of Web Page Purifying Method Based On Document Object Model
7	Web Structure Mining Based On The Maximum Flow And Page Similarity
8	Page Ranking Algorithm Based On Link Similarity Study
9	Research And Implementation Of Page Allocator
10	Library Page Design Study