Font Size: a A A

Research On Some Key Issues In Web Information Extraction

Posted on:2016-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y B YuFull Text:PDF
GTID:2298330470957724Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of Web applications, rich Web informa-tion resources are available on the Internet. To utilize Web information resources, Web information extraction techniques were proposed and have been a hot research topic in Web-related fields. Web information extraction aims to obtain valuable in-formation effectively and accurately from huge amounts of Web data, which in-volves many research issues such as named entity recognition, relation extraction, named entity disambiguation, and sentiment analysis.In this paper, we focus on two key issues in Web information extraction, which are named entity disambiguation and opinionated information extraction. Named entity disambiguation aims to eliminate the ambiguity of a named entity that de-scribe a specific concept in the Web, thus we can determine the exact named entity of an entity mention in the Web. An entity mention may correspond to a few named entities in the Web environment. For example, the entity mention of "Washington" can refer to the American president "George Washington" as well as the city "Wash-ington". Named entity disambiguation has a widely and practical impact on many Web applications, e.g., Web-based question-answering systems, Web search, and machine translation.Opinionated information extraction is aimed at mining opinions from a large set of unstructured Web data and further determining the emotion expressed by Web data. Opinionated information extraction has a majority of applications in modern society. For example, it can help to obtain market intelligence, follow market trends, deliver advertisements precisely, and optimize marketing strategies for enterprises. It is also helpful for governments to detect and monitor public opinions on specific events as well as to deal with unexpected emergence.Based on the analysis on the challenges in named entity disambiguation and opinionated information extraction, we propose a few new designs including new algorithms and experiments for these two issues. In summary, we make the follow-ing contributions in this paper:(1) We present a new Wikipedia-based algorithm fort named entity disambigua-tion. Our algorithm includes the following steps:entity mention recognition, candi-date entity set construction, entity matching, and other processes. In particular, we propose a new method to measure entity similarity, which leverages different Wiki-pedia pages to obtain the implicit semantic association between entity mentions and entity candidates, as well as between different entities. The experimental results on the WISE challenge2013dataset suggest the effectiveness of our proposal.(2) We propose a new algorithm for extracting appraisal expressions from Web data, which is based on dependency grammar and the SVM classification approach. An appraisal expression refers to the modification relationship between an opinion word and its modified target in the sentence, which is able to indicate the emotional opinion of the sentence containing the appraisal expression. We first recognize and extract all appraisal-expression candidates by a pattern-matching approach. Then, we filter the candidates using the SVM model. Specially, we propose to use dependency grammar to automatically construct training data for the classification process, which can improve the efficiency of the classification model. Our experiments on two real datasets w.r.t. car and camera reviews show that our proposal outperforms several baseline methods considering precision and recall.
Keywords/Search Tags:Information extraction, Named entity disambiguation, Opinion analysis, Appraisal expression
PDF Full Text Request
Related items