Research On Web Information Extraction Framework

Posted on:2017-05-17

Degree:Master

Type:Thesis

Country:China

Candidate:X C Teng

Full Text:PDF

GTID:2348330491964091

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The technology of obtaining structured information from semi-structured and unstructured data on the Internet has been widely applied in various fields such as commercial data mining, social network analysis and vertical search engine. The structurization of information is composed of a series of procedures, including setting extraction scope, crawling web pages, page preprocessing, defining the information to be extracted, building extracting rules and information storage, which can be further divided into application-dependent procedures and application-independent procedures. This thesis proposes a general framework for the structurization of information of which the main idea is that setting extraction scope and context relies on the specific application while other operations are independent from application. As a result, a set of description methods are designed to configure the application-dependent procedures while shielding developers from application-independent procedures, thus improving the generality of the framework and efficiency of application development. The main work of this thesis is as follows.(1) A general framework of Web Information Extraction (Web IE) is designed and implemented. The framework abstracts the procedures of information structurization and provides a unified pattern. The overall design is based on engineering principles of abstract and information hiding and abstracts the procedures of information structurization in which setting extraction scope and context are application-dependent while other procedures are application-independent. By letting developers configure the application-dependent procedures while shielding application-independent procedures from them, the generality of the framework and efficiency of application development are improved.(2) A word-class generating method based on the knowledge graph is presented and implemented. In this thesis the notion of word-class is introduced to analyze the topic of a web page and, thus, we use word-class vector for Web page classification. As it is a challenge to construct the classes of words in a Web IE task, we put forward a method to automatically generate these classes, reducing the difficulty of the operation.(3) An information extraction method based on DOM node classification is presented and implemented. Supervised learning method is used to build rules of information extraction. We regard the problem of IE as a classification problem and used a supervised learning method to classify the DOM node in a page in order to extract the information in the node. Three kinds of features are proposed including style feature, content feature and context feature.(4) Web page classification experiment is carried out on the dataset of reference [46] and compared with the baseline. The result shows that the method proposed has a better performance than the baseline. Information extraction experiment is carried out on the dataset of book information pages collected from Amazon and other websites, in which book title, author and book price are extracted. The result shows that the method proposed is both effective and scalable.

Keywords/Search Tags:

information structurization, Web information extraction framework, classification, knowledge graph, extraction rules

PDF Full Text Request

Related items

1	The Design And Implementation Of Building Knowledge Graph System Based On Information Extraction
2	Research On Information Extraction And Fusion Of Knowledge Graph For Unstructured Data
3	Design And Implementation Of E-commerce Information Extraction System Based On Knowledge Graph
4	Construction And Analytics Of An Chinese Enterprise Knowledge Graph
5	XML-based WEB Information Extraction System Research And Implementation
6	Design And Implementation Of Web Information Extraction Rules
7	Research On Web Information Extraction For Domain In Information Integration System
8	Neural Network-based Open Information Extraction And Its Application
9	Research On Information Extraction Techniques For Knowledge Graph Construction In Marine Industry
10	Research And Application Of Information Extraction Technology Oriented To Operator Tariff Knowledge Map