Font Size: a A A

Research On Valuable Event Recognition In Web Data Integration

Posted on:2015-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z XuFull Text:PDF
GTID:1268330431455401Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Web technology, Web has become a huge information source. Web has such characteristics as large scale data, autonomy, heterogeneity and ease of communication, and it becomes an important platform for people obtaining information. How to accurately and efficiently obtain the required information from Web is important for market intelligence and business intelligence. Compared with traditional structured data in data integration, Web page contains huge unstructured data. An event is a kind of unstructured data, which contains a specific time, place and participants involved in the activities. Identifying valuable events in Web pages can provide important data to market intelligence analysis.Web pages contain massive events that provide timely and extensive information to users. However, an event can be reported from different aspects and it has many event description sentences. The different descriptions of the same event are event mentions. It is difficult to identify whether some event mentions refer to a same event. In Web pages, finding events by comparing many coreference event mentions, using confirmed information between these event mentions can has a comprehensive and accurate understanding of identifed events. In addition, integrating Web reviews attention data and topic data can identify valuable event from different levels. These identified valuable events have rich and accurate data, and associated with value information. They can provide support for market intelligence analysis, and they are the basis of further data analysis and mining.Research on Web valuable event recognition has become one of the current hot research problems. Since Web events have the characterictcs of structureless, random describing and rich contact, valuable events recognition not only identifies events but also links value information of events. The following problems are still need to be resolved.(1) There are different news reports for the same event. Since event mentions described a same event from different aspects, there are big differences in them. These event mentions are distributed in a large number of webpages. How to quickly and accurately find event mentions that are refered to a same event is a problem.(2) How to take advantage of additional information between each event mention and fuse many coreference event mentions into an event mention is another problem.(3) Different Web events can have a common topic. How accurately find a common topic in different evetns, extract topic terms and analyze topic terms hot degree is also a problem.This dissertation aims at Web data integration and foucs on the above problems. The innovative works of this dissertation mainly include the following aspects:(1) Due to the characteristics of Web event is unstructured data, an approach based on comprehensive dimension matching and co-occurrence constraint is proposed for Web duplicate event mention recognition.A duplicate event mention recognition approach based on comprehensive dimension matching and co-occurrence constraint is proposed. An event uses eight dimensions to represent, such as{agent, activity, object, time, location, cause, purpose, manner}. The dimension matching method is used to aggregate event mentions. Different matchers measure different dimensions, and an extended evidence theory is proposed to allocate dynamic weight and combine dimension measurement results. An event co-occurrence constraint that can reduce match times is used in the multiple webpages for duplicate event mention recognition. The experiment esults demonstrate that this method can detect various duplicate event mentions and noticeably reduce the number of event mention matching times.(2) Due to the characteristics of many different event mentions pointed to a same event, an approach based on dimension content recombinant is proposed for fusing the duplicate event mentions.A method based on dimension content recombinant is proposed for event mention resolution. We use Markov Logic Networks to combine many first-order logic rules for choosing accurate and detailed dimensions contents. We combine selected dimensions contents and fuse them into an event mention. This unified event mention is detailed and accurate so that this event mention can reflect the objective event. Experimental results show that the proposed approach can select accurate and detailed dimension contents. This approach has high fusion accuracy for event mention resolusion.(3) Due to the characteristics of different events belonged to a common topic, an approach for Web event topic analysis based on topic feature clusteringand extended LDA model is proposed to accurately analyze topic and topic hot degree.To analyze topics of a large number of web events, we proposed an event topic analysis approach by topic feature clustering and extended LDA model. The extended LDA model is dimension LDA (DLDA) which integrates topic probability of LDA. We aggregateevents that have a common topic by topic feature clustering. We accurately detect a common topic from lots of different events and analyze topic terms for events. Experiments on dataset results show that the web event topic analysis approach has high accuracy.
Keywords/Search Tags:Duplicate event mention recognition, Event mention resolution, Dimensionmatch, Dimension content recombinant, Topic analysis
PDF Full Text Request
Related items