Font Size: a A A

Research On Web Information Extraction Framework

Posted on:2017-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:X C TengFull Text:PDF
GTID:2348330491964091Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The technology of obtaining structured information from semi-structured and unstructured data on the Internet has been widely applied in various fields such as commercial data mining, social network analysis and vertical search engine. The structurization of information is composed of a series of procedures, including setting extraction scope, crawling web pages, page preprocessing, defining the information to be extracted, building extracting rules and information storage, which can be further divided into application-dependent procedures and application-independent procedures. This thesis proposes a general framework for the structurization of information of which the main idea is that setting extraction scope and context relies on the specific application while other operations are independent from application. As a result, a set of description methods are designed to configure the application-dependent procedures while shielding developers from application-independent procedures, thus improving the generality of the framework and efficiency of application development. The main work of this thesis is as follows.(1) A general framework of Web Information Extraction (Web IE) is designed and implemented. The framework abstracts the procedures of information structurization and provides a unified pattern. The overall design is based on engineering principles of abstract and information hiding and abstracts the procedures of information structurization in which setting extraction scope and context are application-dependent while other procedures are application-independent. By letting developers configure the application-dependent procedures while shielding application-independent procedures from them, the generality of the framework and efficiency of application development are improved.(2) A word-class generating method based on the knowledge graph is presented and implemented. In this thesis the notion of word-class is introduced to analyze the topic of a web page and, thus, we use word-class vector for Web page classification. As it is a challenge to construct the classes of words in a Web IE task, we put forward a method to automatically generate these classes, reducing the difficulty of the operation.(3) An information extraction method based on DOM node classification is presented and implemented. Supervised learning method is used to build rules of information extraction. We regard the problem of IE as a classification problem and used a supervised learning method to classify the DOM node in a page in order to extract the information in the node. Three kinds of features are proposed including style feature, content feature and context feature.(4) Web page classification experiment is carried out on the dataset of reference [46] and compared with the baseline. The result shows that the method proposed has a better performance than the baseline. Information extraction experiment is carried out on the dataset of book information pages collected from Amazon and other websites, in which book title, author and book price are extracted. The result shows that the method proposed is both effective and scalable.
Keywords/Search Tags:information structurization, Web information extraction framework, classification, knowledge graph, extraction rules
PDF Full Text Request
Related items