Web Page Metadata Extraction Method Based On Visual Block Recognition

Posted on:2018-02-25

Degree:Master

Type:Thesis

Country:China

Candidate:J C Sun

Full Text:PDF

GTID:2348330542451653

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the explosive growth of Internet data and the diversification of content presentation,particularly,intelligent data identification,extraction and analysis of data processing requirements make information collection and processing has become extremely complex,furthermore,the dynamic and open nature of the Internet makes it difficult to organize the network data.The traditional information collection service requires manual analysis of the page's DOM tree structure,analyze and determine the location of the data to be extracted,it can not directly extract the specific needs of the people,and it can not meet the growing demand for personalized service.Therefore,how to automatically identify and accurately locate the need to collect the location of web data has become particularly important.This thesis proposes a method of information extraction based on visual block recognition for video web page information collection,to solve the problem of automatic identification and extraction of web page meta-data.The main work of the paper is as follows:(1)Analysis the web page information extraction technology and visual characteristics.Through the study of DOM tree,visual features,text features,three kinds of web page information extraction technology,compare the advantages and disadvantages of these three technologies,combined with the need to extract the characteristics of the video page information,summarize the user's visual rules,design a web page meta-data extraction method based on the visual characteristics of web pages.(2)Page view block division.This paper is based on the visual DOM tree structure and DIV + CSS web design style,define the rules by defining visual blocks and the combination of rules used by different DOM nodes,use the improved VIPS algorithm to divide the web page information into multiple visual blocks with explicit semantics,and corresponds to a different visual area on the page.(3)Visual block classification and web page meta-data extraction.Based on support vector machine classification algorithm,in this paper,we propose specific eigenvalue extraction rules for the characteristics of video information,divide the visual block into an effective visual block(the information needed to be extracted)and invalid visual blocks,to achieve intelligent identification of data.Finally,use the path expression to extract the page meta-data from the active visual block.(4)Propose the experimental method based on visual block recognition.Based on the mainstream video portals,test the visual block division,effective block recognition and web page meta-data extraction function respectively,the performance of the web page meta-data extraction method based on visual block recognition is detected by the precision and recall rate of the extracted data.

Keywords/Search Tags:

visual block, SVM Light, web page meta-data, path expression

PDF Full Text Request

Related items

1	The Research And Implementation Of One Kind Of Web Page Filtering Method Based On Real-Time Network Traffic Data
2	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
3	Research On A Method Of Focused Crawler Based On Page Partition
4	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
5	Research On Clustering Of Heterogeneous IoT Data Based On The Meta-Path
6	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM
7	Study On Creation Technique For Block-based Web Archive
8	Research On Key Techniques Of Path Expression Query Processing For XML
9	Between The Different Types Of Data Clustering Algorithm
10	Key Techniques Study On Facial Expression Recognition