Font Size: a A A

Research On WEB Page Structure And Data Extraction Technology

Posted on:2021-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:S YuFull Text:PDF
GTID:2518306047498854Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Web pages contain irrelevant content such as navigation bars and advertisements,which has a lot of negative effects on information retrieval,data mining and other fields,so page content extraction technology is very important.Today's web pages can be divided into the following contents: noise content,that is,content that is not related to the theme of the web page,such as navigation bar;theme content,that is,content related to the theme of the web page;theme content includes the main content,content release time,comments,etc.The current page content extraction algorithms are mainly based on structure and content,and the existing structure-based extraction of page content is mainly through templates or heuristic rules.These two methods are not very timeeffective and need to be updated in real time,and have some limitations.Only when the structure of web pages is very similar,can they have a high accuracy.The content-based extraction algorithm largely extracts content that is not the main content,resulting in a low accuracy of page extraction.In order to solve the problems existing in the current technology,this paper studies the page from the perspectives of structure and content.First of al,aiming at the problem that the existing method of extracting page content based on structure is not effective and has limitations,a noise node location algorithm for web page is proposed.The purpose of this algorithm is to extract the theme content of the page.By studying the structural differences between noise nodes and non-noise nodes,a noise node location model based on location features is proposed.By comparing the precision,recal,and comprehensive evaluation F with the templatebased noise node localization method and layout similarity-based noise node localization method,the proposed WEB page noise node localization algorithm is verified.Secondly,aiming at the problem that the accuracy of the existing methods based on the content extraction of the main content of the page is not good,this paper proposes an algorithm for extracting the main content of the basic web page.This algorithm targets the main content of the page.By studying the content difference between the main content and non-main content of the page,the main information extraction model based on multi-node and multi feature analysis is proposed.By comparing the extraction method based on the text tag paragraph extractor and the main content extraction algorithm based on the visual unit,it is verified that the main content extraction algorithm proposed in this paper is better in the accuracy rate,recall rate,comprehensive evaluation F value and time.
Keywords/Search Tags:WEB page, content extraction, theme content, main content, noise node
PDF Full Text Request
Related items