| With the rapid development of the Internet,the Web has become an important source of information for many applications.In addition to the content information,most webpages also contain noisy information such as navigation links,advertisements,recommended links and copyright notices,which reduces the performance of search engines,Web news aggregations,and Web information retrieval applications.Therefore,the extraction of the webpage content has both research significance and application value.This webpage content extraction research based on text block density feature and tag path feature,the main research works of this article are as follows:(1)Based on the potential relevance between web content layouts and their text block density and tag information,design an extraction feature based on text block density: Text Block Density(TBD),which distinguish the content and noise information of the webpages,solve the problem that extracts the short text accurately.Further study of the layout of hyperlink characters in webpage extends the text block density feature for filtering noise information.We present Content Extraction via Text Block Density(CETBD)algorithm to extract webpage content.Experimental results on CleanEval datasets and web news pages randomly selected from several well-known websites show that the CETBD method is a general,efficient,and unsupervised webpage content extraction methods.(2)We design a new feature which is computed by fusing text block density feature and tag path coverage feature together to improve the accuracy of CETBD algorithm.Content Extraction via Text block Density and Tag Path Coverage(CETD-TPC)is proposed based on the new feature to extract webpage content.The experimental results show that CETD-TPC performs better than CETBD,CEPR and CETD.(3)A Web news content extraction system via text block density and tag path coverage is designed and implemented.Both the extraction methods in this paper and several other content extraction algorithms are integrated into the system.The framework,implementation and the user interface of the system are introduced then.Finally,we make a detailed analysis of the advantages and disadvantages of the content extraction methods in real-world applications. |