Font Size: a A A

L4D-An Effective Approach To Web Entity Extraction

Posted on:2009-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhangFull Text:PDF
GTID:2178360242483004Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As computer and web advanced fast, accessible information in various domains has been growing exponentially. Traditional general-purpose search engines are challenged on performance and end user experiences. In recent years, domain-specific vertical search engines facilitated by web entity extraction techniques are emerging with "one-stop shopping" services. Many vendors and research institutes invested on development and research of vertical search tools. In this thesis, we introduce concepts and the architecture of and techniques involved in web entity extraction, including corpus building, visual parser, entity content extraction, attributes labeling, post processing, distributed parallel processing, and pipelining.Based on the observation that entity extraction performance in pages with list of entities (list pages) is better than in pages with single entity (detail pages), we propose L4D (List for Detail). By introducing entity-level clues, L4D can improve the performance of web entity extraction system when handling detail pages. We illustrate two kinds of L4D model, compensatory model and directive model, propose architectures, and identify functional modules. We then discuss pros and cons of each model and their applicability.To measure the performance of web entity extraction system and its components, and to study performance change with the evolution of the system and components, an open domain web entity extraction evaluation solution is proposed, a domain-independent evaluation framework based on entity-relationship model is established, and a portable evaluation system is implemented.We apply L4D to a web entity extraction system, evaluate the performance. And finally, we conclude that L4D can effectively improve performance of entity extraction in detail pages.
Keywords/Search Tags:Vertical Search, Entity Extraction, L4D, Evaluation, Open Domain
PDF Full Text Request
Related items