Webpage Content Extraction Via Text Block Density And Tag Path Feature

Posted on:2018-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:P C Liu

Full Text:PDF

GTID:2348330542992602

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the Web has become an important source of information for many applications.In addition to the content information,most webpages also contain noisy information such as navigation links,advertisements,recommended links and copyright notices,which reduces the performance of search engines,Web news aggregations,and Web information retrieval applications.Therefore,the extraction of the webpage content has both research significance and application value.This webpage content extraction research based on text block density feature and tag path feature,the main research works of this article are as follows:(1)Based on the potential relevance between web content layouts and their text block density and tag information,design an extraction feature based on text block density: Text Block Density(TBD),which distinguish the content and noise information of the webpages,solve the problem that extracts the short text accurately.Further study of the layout of hyperlink characters in webpage extends the text block density feature for filtering noise information.We present Content Extraction via Text Block Density(CETBD)algorithm to extract webpage content.Experimental results on CleanEval datasets and web news pages randomly selected from several well-known websites show that the CETBD method is a general,efficient,and unsupervised webpage content extraction methods.(2)We design a new feature which is computed by fusing text block density feature and tag path coverage feature together to improve the accuracy of CETBD algorithm.Content Extraction via Text block Density and Tag Path Coverage(CETD-TPC)is proposed based on the new feature to extract webpage content.The experimental results show that CETD-TPC performs better than CETBD,CEPR and CETD.(3)A Web news content extraction system via text block density and tag path coverage is designed and implemented.Both the extraction methods in this paper and several other content extraction algorithms are integrated into the system.The framework,implementation and the user interface of the system are introduced then.Finally,we make a detailed analysis of the advantages and disadvantages of the content extraction methods in real-world applications.

Keywords/Search Tags:

Information Extraction, Text Block Density, Tag Path, Feature Fusion

PDF Full Text Request

Related items

1	Self-Adaptive Webpage Content Extraction Via Tag Path Features
2	Research On Entity Relationship Extraction Method Based On Natural Language Multi-feature Fusion
3	Research On Extracting Information By Text Density And Structure Of Webpage
4	Crowd Density Estimation Based On Multi-feature Fusion
5	Text Entities And Their Relationship Mining Based On Feature Fusion
6	Research On Feature Extraction And Classification Algorithm In Text Categorization
7	Study On Population Density Estimation Based On Video Image
8	Research On Arbitrary Shape Text Detection Algorithm Based On Multi-path Fusion
9	Feature-based Document Image Retrieval
10	Algorithm Research For Text Information Extraction Based On Hidden Markov Model