Font Size: a A A

Research And Implementation Of Web Page Classification Based On CNN And SVM

Posted on:2021-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:H Y XuFull Text:PDF
GTID:2518306503499424Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Internet information is growing at an explosive rate every day.Transmitting information with webpages is very traditional and common way.Webpage classification refers to demarcating the category of information domain according to the content or purpose of a webpage.In order to implement information retrieval,interest recommendation,content monitoring and other functions,people have been studying how to classify webpages more quickly and accurately.At present,most commonly used webpage classification methods are based on the understanding of webpage content.These methods rely on the webpage code and the corpus of specific languages and industries.Therefore,when dealing with the classification of webpages in multiple languages or industries,the accuracy of the model is low and the generalization of the classification models is very poor.This paper studies and implements a method of web page classification based on web page screenshots.Web page screenshots of different languages are collected by crawlers.Preprocesses the original data to get the samples,and trains a multi-classifier for 6 target types with CNN as basic network.This model proves the feasibility of webpage classification based on webpage screenshots.Then analyzes the problems of this method.On this basis,there are two ideas to optimize:(1)by changing the structure of neural network,to combine CNN and SVM,and restructuring more groups of positive and negative sample set,trains a group of binary classifiers.This model uses both image features automatically extracted by deep neural network and expert knowledge features based on visual semantics.(2)4sub-areas with specific meanings are chosen from the original screenshot,and the classification model of corresponding region is trained in turn and merged together.Thus,the utilization rate of screenshot information is improved.4 classification models(including 1 intermediate model)were trained through experiments.Comparing with the two models obtained by the existing training method,the single-category sensitivity and overall accuracy of the model in this paper were higher than those of the control model,and the overall accuracy of the model with the best effect reached 82.44%.The method studied in this paper classifies webpages with screenshots and without understanding the content.It solves the problem that existing approaches are highly dependent on the source code of web pages and highly related to languages and corpus,and can be applied to classify webpages in multiple languages.
Keywords/Search Tags:webpage classification, webpage screenshot, visual semantics, convolutional neural network, support vector machine
PDF Full Text Request
Related items