Font Size: a A A

Domain-Oriented Web Entity Expansion And Robust Optimization Of The Wrapper

Posted on:2016-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:S G ChenFull Text:PDF
GTID:2348330461480028Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the high development of society informationization, the rapid development of the Internet produced a huge number of web data information. Web users'activities heavily depend on rich information, especially the domain-oriented data. It is an urgent problem that how to help the users find data of interest. How to get information efficiently from the massive amounts of data on the Web and how to build the domain-oriented entity database have become our key researches. This paper researched how to expanse the domain-oriented web entity information to provide support for the domain-oriented applications.The existing domain-oriented Web entity extraction systems focus on certain industry to collect information, considering simple structured pages as the extraction object. But they do not perform very well on these noisy data sources. Most previous approaches regarded this as an information extraction problem on individual do cuments, and made no special use of numerical attributes. Besides that, these extraction systems should improve their robustness to cope up with changes in websites.This paper studied the methods of extract data entities based on domain information from the web. Combining with the feature of web data, this paper proposed a model for data entities extraction base on domain information, and designed a system for extraction and expansion of real estate information data entities using a domain-oriented topical crawler model. This paper studied the main modules were studied in the model, researched experimentally and analyzed the experimental results.The main contents of the dissertation are summarized as follows.1. In this paper we propose a model for data entities extraction base on domain information and around this model made in-depth study of entity set expansion, entity attribute values extraction and the wrapper's robustness optimization. Kinds of topical crawler model could be used in this model which has good extensibility.2. Entity Set Expansion:First we model the data as bipartite graphs, with candidate entities being nodes on one side and their contexts on the other side. We then formally define the problem of set expansion using the new similarity metric and quality metric. Based on these measures, we further develop a class of iterative set expansion algorithms.3. Entity Attribute Values Extraction:We present a method base on Integer Linear Program to complete the attribute values filling of the entities. The method is based on an framework which leverages signals not only from the Web page context, but also from a collective analysis of all the pages corresponding to an entity, and from constraints related to the actual values within the domain.4. The Wrapper's Robustness Optimization:Webpages frequently change, and even very slight changes cause the wrapper to break. We use a robust extraction framework and and optimize this model to construct optimal wrappers. By evaluating on real websites, we demonstrate that in practice, our algorithms are highly effective in coping up with changes in websites, and reduce the wrapper breakage.This paper dose exploratory study on entity set expansion, entity attribute values extraction and the wrapper's robustness optimization in the web entity expansion system, and provides an effective method to this issue. According to the real estate domain's practical demand, a Web entity expansion system is developed and shows the method mentioned in this paper to be effective. This makes the researeh of this paper have the oretieal researeh value and a wide range of practical application value.
Keywords/Search Tags:Domain-oriented web entity, Data extraction and integration, Entity set expansion, Entity attribute values extraction, Robustness of wrappers
PDF Full Text Request
Related items