| With the rapid development and increasing popularity of the Internet,web pages have become an important source of information.While providing users with useful information,web pages are also filled with various commercial advertisements(ads).These ads may occupy system resources,influence the display of web contents,induce users to visit harmful web pages,affect usage experience,and finally reduce users'stickiness.The existing method of blocking website ads is based on filtering rules and the core of the method is to maintain a list of filtering rules.Current most popular block-ing tool,Adblock Plus,is based on the EasyList,a filtering rule list,blocks ads through network control and in-page manipulation.Although the method of blocking ads based on the filtering rules list can partly alleviate the troubles caused by ads,the method needs to be continuously maintained according to user's feedback,which resulting in high time cost and manpower cost.Otherwise,with the appearance of web page ran-domization technology,filtering rule matching method will fail.In addition,because developers may misuse the contents of the filtering rules list when defining element's id or class attribute values,which causes normal content is blocked.Thus,in order to avoid the waste of time and labor costs for maintaining the list of filtering rules,and to reduce the number of false positives and false negatives in web ads blocking tool.This thesis firstly empirically investigates 200 Web pages in 4 categories,excavating the structures of the real web page ad regions in the web page source codes,which summarizes 4 forms of ads label nodes in ad regions.And presenting a method of recognizing ads region by ads label,which is based on code analysis and image process technology,and implementing a tool AdClear to block ads.The main work of the thesis includes:(1)Completing the analysis of web page code by recursively processing the DOM tree generated by web page HTML code.When traversing the DOM tree,different processing will be performed according to different types of nodes.Especially,the node containing the image will be sent to the server for identification.In order to reduce the pressure on the server side,the corresponding filtering rules are further proposed,and the nodes are selected to be sent to the server for judgment.According to the structure of the actual web page ads region code,presenting a method to identify the minimum ads region by the ad label.(2)Combining the characters that background color in ad label changes smoothly and the boundary is clear between the characters and background.Using information entropy and Canny operator edge detection to binarize the image and using HOG fea-tures and CNN to extract features from the binarized image.Then using SVM and MLP classification model to achieve image text classification,and complete image ad label recognition.Finally,a combination of different binarization,feature extraction and classification model techniques are used to complete the ad label recognition in the image.The three methods are Information Entropy+HOG+SVM,Canny Operator+HOG+SVM,and Canny Operator+CNN.(3)Implementing the tool AdClear and comparing it with Adblock Plus to demon-strate its effectiveness and efficiency.In our experimentals,comparing the effective-ness of the three image recognition methods,and Canny operator+HOG+SVM is the best.So choosing it in our image recognition module.In the experiment of actual ads detection,AdClear has better results than Adblock Plus,with an accuracy of 99.55%and recall of 96.52%,compared with 62%in accuracy and 92.34%in recall of Adblock Plus. |