Self-Adaptive Webpage Content Extraction Via Tag Path Features

Posted on:2017-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:J Hu

Full Text:PDF

GTID:2348330485962230

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, the web has become an important platform to publish information. However, most of webpages contain not only main content, but also unrelated information, such as navigation links, advertisements and copyright information, which are known as noise. Noise in webpages hampers the performance of search engine, news aggregator system, etc., and is also a burden to the storage systems. Therefore, webpage content extraction is significantly important in both research and real-wold applications.The main research works in this dissertation are as follows:(1) We present Content Extraction via Tag Path Feature Fusion(CEPF)-a method to extract content text from news webpages by using tag path feature fusion. We design a series of tag path features and a feature fusion method to fuse these features to a new feature, namely called TPF. In contrast with each of the tag path features, TPF has a better ability to distinguish content from noise. In the step of feature fusion, a feature selection method based on spectral clustering is employed to remove redundant features. Based on tag path edit distance, CEPF utilizes Gaussian smoothing to update the value of TPF. Then Otsu's method is used to extract content text from webpages adaptively. CEPF is unsupervised. Experiment results show that CEPF is accurate, general and language-independent.(2) We propose Content Extraction via Long Text Ratio(CELTR)-a method to extract subtrees corresponding to the content from a webpage's DOM tree. In the CELTR algorithm, the Long Text Ratio(LTR) for each subtree of a DOM tree is computed by Otsu's method adaptively. In most cases, LTR assigns higher values for the subtrees that belong to the content of a webpage, but there are exceptions. To address the problem, we design two features LTRS and RLTRS which are extended from LTR. CELTR extracts content from a webpage by clustering subtrees with LTR, LTRS and RLTRS. CELTR is unsupervised. Experiment results show that CELTR is accurate, general and language-independent. CELTR can also preserve the structure of content.(3) We design and develop a domain-oriented web news aggregator system in which CEPF and CELTR are used to solve webpage content extraction problem. In the meanwhile, we make detailed analysis of the advantages and limitations of content extraction methods in real-world applications.

Keywords/Search Tags:

Information Extraction, Tag Path Feature, Feature Fusion, Feature Selection

PDF Full Text Request

Related items

1	Feature Extraction And Feature Fusion For Content-Based Image Retrieval
2	Research On Feature Extraction Technology Of Communication Transimitter Individual
3	Research Of Multi-dimensional Fingerprint Feature Extraction And Fusion For Specific Emitter Identification Technology
4	Based On Kernel Feature Fusion And Selection Of Face Recognition Research
5	Design And Implementation Of Feature Extraction System For Large-Scale Structured Data
6	A Research On Feature Selection And Fusion In Palmprint Recognition
7	Visual Feature Adaptive Selection And Fusion Method For Robust Tracking
8	Study On Model And Algorithm Of Dynamic Feature Fusion Based On Information Sources Selection And Sequential Extraction
9	Face Detection Based On Fusion Of Multiple Features With Cascade Support Vector Machines
10	The Research And Application Of Text-Independent Speaker Recognition Technology