Font Size: a A A

Research And Application On Web Page Filtering Technology

Posted on:2015-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2298330434450107Subject:Information networks and security
Abstract/Summary:PDF Full Text Request
ABSTRACT:The expansion of Internet makes it very important to obtain useful information from the mass web pages. However, the large amount of noise information(advertises, copyrights and navigations, etc) brings big interference for search engines to index pages. This urgent need for web pages purification makes web information extraction technology become a hot research point.Web Page filtering technology intends to filter out the large number of the repeated and theme-unrelated noise information and obtain useful information. Many scholars have recently put forward many web page filtering methods based on the characteristics of various web pages. This paper analyzes the advantages and disadvantages of the existed web page filtering methods, and points out that some web filtering methods cannot make full use of the layout and visual features.In view of the new mainstream "DIV+CSS" designing style of modern commercial web sites, this paper summarizes that elements laying in the same div blocks have common semantic features and proposed a DIV_FOREST model to represent the web pages. And in combination with the Vision-based Page Segmentation Algorithm, a DVPS Algorithm which considers both layout features and visual features was proposed to improve web page filtering efficiency.Based on the web page segmentation work, this paper extracts the spatial location, semantic features and visual performance of the data blocks for further analysis and quantification. Then a criterion of distinguishing useful information from noise blocks was proposed at last.The paper then compares the performance of the new DVPS Algorithm with VIPS Algorithm, test results show that the new Algorithm have good performance in processing the "CSS+DIV"-based web pages. Finally, on the basis of the previous filtering work, the purified web page was put into a web page classifier, the effect of the classifier will be a good assessment of the Algorithm. Experiment results can prove the rationality of the proposed segmentation model and the validity of the web page filtering method.This work has been supported by the National Natural Science Foundation of China under Grant61172072,61271308, and Beijing Natural Science Foundation under Grant4112045, and the Research Fund for the Doctoral Program of Higher Education of China under Grant W11C100030, the Beijing Science and Technology Program under Grant Z121100000312024, and Beijing Municipal Commission of Education Discipline Construction and Graduate Construction Project.
Keywords/Search Tags:Web Page Data Filtering, Web Page Segmentation, DIV_FORESTModel, DVPS Algorithm
PDF Full Text Request
Related items