Font Size: a A A

Multi-filter String Matching and Human-centric Entity Matching for Information Extraction

Posted on:2013-06-05Degree:Ph.DType:Thesis
University:The University of Wisconsin - MadisonCandidate:Sun, ChongFull Text:PDF
GTID:2458390008487069Subject:Information Technology
Abstract/Summary:
More and more information is being generated in text documents, such as Web pages, emails and blogs. To effectively manage this unstructured information, one broadly used approach includes locating relevant content in documents, extracting structured information and integrating the extracted information for querying, mining or further analysis. In this thesis, we consider two common and ubiquitous problems, approximate string membership checking and entity matching. The approximate string membership checking problem is to find all the strings in the documents that approximately match some string in a given dictionary. A filter-verification based approach is well recognized as a good way to solve this problem. We propose a new string filter, the token distribution filter, and we use both synthetic and real data sets to empirically verify that the token distribution filter performs well. However, we observe that the token distribution filter is not superior to other filters in all cases. We suspect that maybe no single optimal filter exists for different problem instances. Accordingly, we propose to view approximate string membership checking as an optimization problem, and we propose a multi-filter, optimization based approach to fully utilize all the available string filters to get the best performance. Entity matching is to identify the data records referring to the same entity. Through entity matching, we can accurately integrate all the information on the same entity, or compare the information about the same entity from different sources. We design a human-centric, two-phase entity matching approach, in which users can iteratively check the data records or the intermediate results, propose rules and apply rules to achieve high accuracy. We also propose techniques to make users more efficient and effective during the entity matching process.
Keywords/Search Tags:Entity matching, Information, String, Filter, Propose
Related items