Research And Implementation Of Web Page Classification Based On CNN And SVM

Posted on:2021-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Xu

Full Text:PDF

GTID:2518306503499424

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Internet information is growing at an explosive rate every day.Transmitting information with webpages is very traditional and common way.Webpage classification refers to demarcating the category of information domain according to the content or purpose of a webpage.In order to implement information retrieval,interest recommendation,content monitoring and other functions,people have been studying how to classify webpages more quickly and accurately.At present,most commonly used webpage classification methods are based on the understanding of webpage content.These methods rely on the webpage code and the corpus of specific languages and industries.Therefore,when dealing with the classification of webpages in multiple languages or industries,the accuracy of the model is low and the generalization of the classification models is very poor.This paper studies and implements a method of web page classification based on web page screenshots.Web page screenshots of different languages are collected by crawlers.Preprocesses the original data to get the samples,and trains a multi-classifier for 6 target types with CNN as basic network.This model proves the feasibility of webpage classification based on webpage screenshots.Then analyzes the problems of this method.On this basis,there are two ideas to optimize:(1)by changing the structure of neural network,to combine CNN and SVM,and restructuring more groups of positive and negative sample set,trains a group of binary classifiers.This model uses both image features automatically extracted by deep neural network and expert knowledge features based on visual semantics.(2)4sub-areas with specific meanings are chosen from the original screenshot,and the classification model of corresponding region is trained in turn and merged together.Thus,the utilization rate of screenshot information is improved.4 classification models(including 1 intermediate model)were trained through experiments.Comparing with the two models obtained by the existing training method,the single-category sensitivity and overall accuracy of the model in this paper were higher than those of the control model,and the overall accuracy of the model with the best effect reached 82.44%.The method studied in this paper classifies webpages with screenshots and without understanding the content.It solves the problem that existing approaches are highly dependent on the source code of web pages and highly related to languages and corpus,and can be applied to classify webpages in multiple languages.

Keywords/Search Tags:

webpage classification, webpage screenshot, visual semantics, convolutional neural network, support vector machine

PDF Full Text Request

Related items

1	Design And Implementation Of Content-based Webpage Collection And Classification System
2	Research On Malicious Webpage And PDF Document Detection Based On SVM Model
3	Research On Webpage Classification Based On Sparse Auto-Encoder And Layer-wise Back Propagation
4	The Research And Design Of Network Information Monitoring And Analysis System
5	Malicious Web Page Detection System Based On Classification Algorithm
6	Web Pages Classification Based On Active Learning Support Vector Machine Learning
7	Research On Classification Algorithm For Chinese Webpage
8	On The Design And Implementation Of Automatic Webpage Classification Algorithm
9	Web Page Classification Oriented To Web Personalization System
10	The Research Of Webpage Denoising Method Based On Classification Technology