Font Size: a A A

Web Page Metadata Extraction Method Based On Visual Block Recognition

Posted on:2018-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:J C SunFull Text:PDF
GTID:2348330542451653Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of Internet data and the diversification of content presentation,particularly,intelligent data identification,extraction and analysis of data processing requirements make information collection and processing has become extremely complex,furthermore,the dynamic and open nature of the Internet makes it difficult to organize the network data.The traditional information collection service requires manual analysis of the page's DOM tree structure,analyze and determine the location of the data to be extracted,it can not directly extract the specific needs of the people,and it can not meet the growing demand for personalized service.Therefore,how to automatically identify and accurately locate the need to collect the location of web data has become particularly important.This thesis proposes a method of information extraction based on visual block recognition for video web page information collection,to solve the problem of automatic identification and extraction of web page meta-data.The main work of the paper is as follows:(1)Analysis the web page information extraction technology and visual characteristics.Through the study of DOM tree,visual features,text features,three kinds of web page information extraction technology,compare the advantages and disadvantages of these three technologies,combined with the need to extract the characteristics of the video page information,summarize the user's visual rules,design a web page meta-data extraction method based on the visual characteristics of web pages.(2)Page view block division.This paper is based on the visual DOM tree structure and DIV + CSS web design style,define the rules by defining visual blocks and the combination of rules used by different DOM nodes,use the improved VIPS algorithm to divide the web page information into multiple visual blocks with explicit semantics,and corresponds to a different visual area on the page.(3)Visual block classification and web page meta-data extraction.Based on support vector machine classification algorithm,in this paper,we propose specific eigenvalue extraction rules for the characteristics of video information,divide the visual block into an effective visual block(the information needed to be extracted)and invalid visual blocks,to achieve intelligent identification of data.Finally,use the path expression to extract the page meta-data from the active visual block.(4)Propose the experimental method based on visual block recognition.Based on the mainstream video portals,test the visual block division,effective block recognition and web page meta-data extraction function respectively,the performance of the web page meta-data extraction method based on visual block recognition is detected by the precision and recall rate of the extracted data.
Keywords/Search Tags:visual block, SVM Light, web page meta-data, path expression
PDF Full Text Request
Related items