Font Size: a A A

Self-Adaptive Webpage Content Extraction Via Tag Path Features

Posted on:2017-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:J HuFull Text:PDF
GTID:2348330485962230Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the web has become an important platform to publish information. However, most of webpages contain not only main content, but also unrelated information, such as navigation links, advertisements and copyright information, which are known as noise. Noise in webpages hampers the performance of search engine, news aggregator system, etc., and is also a burden to the storage systems. Therefore, webpage content extraction is significantly important in both research and real-wold applications.The main research works in this dissertation are as follows:(1) We present Content Extraction via Tag Path Feature Fusion(CEPF)-a method to extract content text from news webpages by using tag path feature fusion. We design a series of tag path features and a feature fusion method to fuse these features to a new feature, namely called TPF. In contrast with each of the tag path features, TPF has a better ability to distinguish content from noise. In the step of feature fusion, a feature selection method based on spectral clustering is employed to remove redundant features. Based on tag path edit distance, CEPF utilizes Gaussian smoothing to update the value of TPF. Then Otsu's method is used to extract content text from webpages adaptively. CEPF is unsupervised. Experiment results show that CEPF is accurate, general and language-independent.(2) We propose Content Extraction via Long Text Ratio(CELTR)-a method to extract subtrees corresponding to the content from a webpage's DOM tree. In the CELTR algorithm, the Long Text Ratio(LTR) for each subtree of a DOM tree is computed by Otsu's method adaptively. In most cases, LTR assigns higher values for the subtrees that belong to the content of a webpage, but there are exceptions. To address the problem, we design two features LTRS and RLTRS which are extended from LTR. CELTR extracts content from a webpage by clustering subtrees with LTR, LTRS and RLTRS. CELTR is unsupervised. Experiment results show that CELTR is accurate, general and language-independent. CELTR can also preserve the structure of content.(3) We design and develop a domain-oriented web news aggregator system in which CEPF and CELTR are used to solve webpage content extraction problem. In the meanwhile, we make detailed analysis of the advantages and limitations of content extraction methods in real-world applications.
Keywords/Search Tags:Information Extraction, Tag Path Feature, Feature Fusion, Feature Selection
PDF Full Text Request
Related items