Font Size: a A A

A Dynamic Learning Framework To Automatically Extract Structured Data From Web Pages Without Human Efforts

Posted on:2012-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WuFull Text:PDF
GTID:2218330362453627Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In growing huge numbers of web pages, tremendous concrete and comprehensive information is contained in structured data that can be applied for search engines and knowledge database. Although various styles make it difficult to extract attributes and their corresponding values of entities, adequate knowledge from different web pages and different websites can be learned by computer. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs) without human effort. Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and fully extracting structured data from web pages. Towards such a target, a credible attributes learning system is firstly built to generate credible attributes by utilizing structural features, inner-site features and cross-site features of web pages. Specially, its accuracy will be dynamically promoted when new pages are added to this system. Secondly, a structured data discovery and extraction procedure is proposed to extract both credible and unseen attributes and their attribute values. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.
Keywords/Search Tags:information extraction, structured data, attributes discover, learning system
PDF Full Text Request
Related items