Font Size: a A A

Research On Web Entity Event Duplication Detection

Posted on:2015-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2268330431455461Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the network technology changing rapidly, the amount of information on the Web is growing rapidly. The Web has become a huge source of data, with the vast amounts of data. There are a lot of valuable entity events those play important roles for people in the daily work and social production contained in the data. If been fully excavated and analyzed, the entity events on the Web will provide people with a wealth of knowledge, which has great significance in the market intelligence analysis, public opinion analysis, e-commerce, business intelligence and other areas. However, the Web is a free and open space, the Web entity events are from different data sources, data sources have strong autonomy, the information is relatively free to been published. What’s more, natural language has the characteristics of free and flexible. The same entity event being represented by different formulation is a very common phenomenon, which brings great difficulties to the discovery and analysis of entity events. This phenomenon also brings great distress to the users searching for information and the decision-makers.To allow users to get concise, accurate and non-repetitive entity event information, the entity event duplication detection for the entity event representations from different data sources is essential, it is also an important subtask of Web entity event discovery and research. To achieve this goal, the following two key problems need to be solved.(1) The duplication detection of entity event representations. Doing the duplication detection on the entity event representations from different data sources and identifying the different representations of the same entity event. That is identifying the entity event representations which have the same meaning but different ways of expression.(2) The duplication detection of entity events. An entity event is represented by the set consisting of duplicated entity event representations. After the duplication detection of entity event representations, there may still be duplication among entity events. The entity event duplication detection is still needed to be done.In this paper, different entity event representations, the relationships between entity events were studied. Aiming at duplication detection of Web entity events, to solve the two key issues above, this paper explores the two problems of entity event representation duplication detection and entity event duplication detection. The main work and research in this paper are as follows:(1) For the issue of entity event representation duplication, according to the rule in the commercial area, a certain subject can attend only one activity at the same time and same location, this paper proposes "a method based on linear combination with dynamic weight" for event duplication detection. This method first calculate the similarity scores of the three main attributes of time, location, subject and the similarity score of other auxiliary attributes in the entity event representation pair; at the same time, it calculate the dynamic weights. Then it calculates the similarity score of the entity event representation pair using the similarity scores of the attributes and the dynamic weights. Finally, it compares the similarity score of the entity event pair with a specific threshold and makes judgment on whether the entity event pair is duplicated or not. Experimental results show that the method can improve the F-measure significantly and solve the problem of entity event duplication effectively.(2) An entity event is represented by the set consisting of duplicated entity event representations and duplication may exist among entity events. Thus, based on "the method based on linear combination with dynamic weight", this paper further proposed two methods,"the method based on entity event attributes" and "the method based on entity event relationships", to solve the problem of entity event duplication. The first method is a direct method for entity event duplication detection, it compares the events which need duplication detection directly. The second method is based on the first method, it is an indirect method. According to the relationships among entity events, it compares the event sets those have relationships with the entity events which need to be detected, not the entity events which need to be detected themselves. Then the relation similarity score about the entity events is obtained.
Keywords/Search Tags:Web Entity Event, Duplication Detection of Entity Event, Dynamic Weight, Entity Event Relationship
PDF Full Text Request
Related items