A Dynamic Learning Framework To Automatically Extract Structured Data From Web Pages Without Human Efforts

Posted on:2012-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y P Wu

Full Text:PDF

GTID:2218330362453627

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In growing huge numbers of web pages, tremendous concrete and comprehensive information is contained in structured data that can be applied for search engines and knowledge database. Although various styles make it difficult to extract attributes and their corresponding values of entities, adequate knowledge from different web pages and different websites can be learned by computer. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs) without human effort. Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and fully extracting structured data from web pages. Towards such a target, a credible attributes learning system is firstly built to generate credible attributes by utilizing structural features, inner-site features and cross-site features of web pages. Specially, its accuracy will be dynamically promoted when new pages are added to this system. Secondly, a structured data discovery and extraction procedure is proposed to extract both credible and unseen attributes and their attribute values. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.

Keywords/Search Tags:

information extraction, structured data, attributes discover, learning system

PDF Full Text Request

Related items

1	Research On Structured Data Extraction From Web Forums
2	Research On Keyword Extraction And Structured List Data Extraction
3	Research On Extraction And Fusion Of Structured Character Attributes In Web
4	Research On Event Extraction Based On Structured Learning
5	Research And Application Of Extraction Method Of Semi-structured Text Information
6	Ontology-Based Structured Information Extraction From Web Pages
7	Literature Information Extraction System From Academic Homepage
8	The Implementation And Application Of Extracting Structured Data From Web Pages
9	Chinese BBS Information Extraction And Classification
10	Research Of Web Information Extraction Technology Based On Tree Structure