Font Size: a A A

Research On Web Entity Activity And Entity Relationship Extraction

Posted on:2013-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhangFull Text:PDF
GTID:2248330374982614Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has become a huge data source, which has mass data. How to efficiently, comprehensively and accurately integrate the valuable information on Web in order to provide data for market intelligence analysis, search engine, intelligent question and answer and other systems to make the knowledge base of market intelligence analysis and intelligent question and answer richer, help reasoning perfect results and return accurate data from search engine for users, gradually become a research hotspot and difficult point for the research areas, likely data integration, information retrieval, natural language understanding and so on. The first question for integrating Web data is information extraction, which research how to extract structured data from unstructured data.Since Web data have such characteristics as massiveness, heterogeneity, autonomy and distribution, the existing information extraction technologies can’t satisfy the high efficiency, comprehensive and accurate requirements of Web data integration at the same time. On the one hand, facing the massive and distributed Web data, the goal of current information extraction is to harvest named entities, entity relationships and entity attributes. However, the extraction method limited by the domain of extraction objects, the result is simple and the content is not rich enough. On the other hand, facing the heterogeneous and autonomous unstructured Web data, the extraction methods focus on the accuracy of the results, but the efficiency can’t satisfy the need of large scale information extraction.This paper is dedicated to the study of information extraction. For the above problems, our goal is to harvest more valuable information from the massive and heterogeneous Web data sources to make the result richer on the premise of accuracy. There is much data which describes the entity activities on the Web. However, few works focus on detailed defining and extracting this data type. Traditional relation extraction systems seek to distill semantic relational facts from natural language text by assuming that facts are time-invariant. Lack of temporal knowledge makes the availability of relational facts bad.Our works focus on problems that the existing information extraction technologies only can be used in limited scope and the availability of contents is poor, and the contributions are shown as fellow:1. We propose a method based on SVM and extended conditional random fields to extract Web entity activities, which can accurately and efficiently extract Web entity activities-a new unexploited data type on the Web in multi domains.A Web entity activity describes a behavior or activity of an entity in a certain time and place, and a set of activities of a certain entity forms a trail of this entity. Few works pay attention to this data type. On the basis of traditional information extraction, this paper defines the formal model of entity activity based on case grammar and presents a method based on support vector machine and extended condition random fields to extract Web entity activities accurately. Firstly, in order to automatically train our machine learning models, we put forward a heuristic method to transform the semantic role labeling training data into the training data of entity activity extraction. Then, we train a support vector machine classifier and extended condition random fields using the training data. Thirdly, making use of the classifier, we distinguish the sentences which contain Web entity activities. We put forward extended condition random fields to model the frequency feature and relationship feature which the traditional conditional random fields can’t model and the new model can label the entity activity information in natural language sentences more accurately. Finally, the experiments prove our method is effective in multi-domains and can well apply to Web entity activity extraction.2. We put forward a bootstrapping method to harvest temporal knowledge for Web entity relationships. This method can accurately and efficiently extract temporal knowledge of Web entity relationships, which makes the contents richer and more available.Traditional relation extraction systems seek to distill semantic relational facts from natural language text by assuming that facts are time-invariant. However, fact evolves over time; relations have associated validity intervals; time-dependent relations seem to be far more common than time-invariant ones. Therefore, relations should include time as a first-class dimension. In this paper, we present an approach for automatically harvesting temporal knowledge of entity relationships. Our extraction framework is bootstrapping, by taking the relation instance as a separate knowledge dimensions. The discriminate MNLs can soften hard rules which are usually applied in bootstrapping relation extraction systems, by learning their weights in a maximum likelihood estimate sense. In order to avoid the manually marked training data, we first generate the training data based on heuristic method, and patterns are selected by doing L1-norm regularized maximum likelihood estimation, and we also use full parser to preprocess the natural language in order to make use of the dependency features. The experiments show that our framework is domain-independent, and can automatically and effectively harvest temporal knowledge of relationships.
Keywords/Search Tags:Web entity activities, Information extraction, Machine learning, Web entity relationship, Temporal knowledge of relationships
PDF Full Text Request
Related items