Research On WEB Page Structure And Data Extraction Technology

Posted on:2021-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:S Yu

Full Text:PDF

GTID:2518306047498854

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Web pages contain irrelevant content such as navigation bars and advertisements,which has a lot of negative effects on information retrieval,data mining and other fields,so page content extraction technology is very important.Today's web pages can be divided into the following contents: noise content,that is,content that is not related to the theme of the web page,such as navigation bar;theme content,that is,content related to the theme of the web page;theme content includes the main content,content release time,comments,etc.The current page content extraction algorithms are mainly based on structure and content,and the existing structure-based extraction of page content is mainly through templates or heuristic rules.These two methods are not very timeeffective and need to be updated in real time,and have some limitations.Only when the structure of web pages is very similar,can they have a high accuracy.The content-based extraction algorithm largely extracts content that is not the main content,resulting in a low accuracy of page extraction.In order to solve the problems existing in the current technology,this paper studies the page from the perspectives of structure and content.First of al,aiming at the problem that the existing method of extracting page content based on structure is not effective and has limitations,a noise node location algorithm for web page is proposed.The purpose of this algorithm is to extract the theme content of the page.By studying the structural differences between noise nodes and non-noise nodes,a noise node location model based on location features is proposed.By comparing the precision,recal,and comprehensive evaluation F with the templatebased noise node localization method and layout similarity-based noise node localization method,the proposed WEB page noise node localization algorithm is verified.Secondly,aiming at the problem that the accuracy of the existing methods based on the content extraction of the main content of the page is not good,this paper proposes an algorithm for extracting the main content of the basic web page.This algorithm targets the main content of the page.By studying the content difference between the main content and non-main content of the page,the main information extraction model based on multi-node and multi feature analysis is proposed.By comparing the extraction method based on the text tag paragraph extractor and the main content extraction algorithm based on the visual unit,it is verified that the main content extraction algorithm proposed in this paper is better in the accuracy rate,recall rate,comprehensive evaluation F value and time.

Keywords/Search Tags:

WEB page, content extraction, theme content, main content, noise node

PDF Full Text Request

Related items

1	Research On Content Extraction In HTML Web Pages Based Multi-Features
2	The Research And Implementation On Content Extraction In Web Pages Based Page Segmentation
3	Analysis Of Deep Web Page's Structure And Its Rich-Content Extraction
4	Research On The Technique Of Extracting Web Page Informational Content Based On Node Type Annotation
5	Research On Web-Based Extraction Technology Of Hyperlink And Web Page Content
6	The Designation And Implementation Of Business Insight System Base On Web Content
7	Study On Web Content Extraction And Semantic Recognition
8	The Bank Competitive Intelligence Collection System Based On Internet
9	Research On Content Scheduling Technologies In Content Networks
10	Research On Routing And Caching Algorithm In Content Center Network