Font Size: a A A

Open Domain Event Extraction From Microblog

Posted on:2014-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:J J GaoFull Text:PDF
GTID:2268330392462825Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, with increasing number of people using social network like facebook andtwitter to share their social status, these status messages contain huge amount ofinformation about their daily life and the latest topics which drew the most publicattention. By analyzing and extracting from these messages, we can reorganize themwith structured expression. The extracted event information by automatic orsemi-automatic methods can help people catch hot events faster and better, especiallyfor emergencies, and provide clues for journalists, at the meantime offer auxiliaryinformation for electronic public opinion prediction.The status messages from Microblogs often contain events happened in the realword that caused particular concern. However, Microblog information is subjects todiscretion and redundance preventing us from getting accurate and complete eventinformation of Microblogs. The event information from Microblogs consists ofNamed Entity and Event Phrase which has a description of Named entities’ status andbehavior. As a result, we can obtain event information from Microblogs by capturingNamed Entity and Event Phrase, which leads us to the centre of the whole tasks,indentifying Named Entity and Event Phrase.Traditionally, information extraction focuses on specific domain and patterns,shifting to a new domain requires to manually creating new extraction rules orhand-tag new training examples. Due to the openness of Micorblog platform, userscan publish status messages anytime anywhere, which makes traditional eventextraction methods have poor results for event extraction from Microblogs. As amatter of fact, event extraction has been gradually developed from traditional event extraction to open domain event extraction. In contrast to traditional event extraction,the advantage of open domain event extraction is that the system only makes asingle-data-driven and shifting a new domain need not create new rules or trainingexamples. The representative English open domain event extraction system isTWICAL developed by Artifical Intelligence Group of Washington University. SinceChinese has no separator between words, the ambiguity of words makes poor resultsof Chinese words segmentation. In addition, the structure of named entity and eventphrase are complicated and varied. In a word, with the unique characters of Chinesethe event extraction becomes a challenging task. Until now, there is no matureChinese open domain event extraction system.In this paper we demonstrated an open-domain event extraction system forChinese: EventCalendar. The system demonstrates a Chinese open-domain calendarof significant events extracting from microblog. We regard the event extractionprocess as a sequence labeling process in which the CRF model is applied to theMicroblog event extraction task successfully. In our system, First, the Microblgmessages are tokenized and tagged with POS(part-of-speech tags) using NLPIR,which is a Chinese word segmentation system released by Dr.Zhang Huaping. ThenNamed Entities and Event Phrase are extracted by using the sequence labeling model,CRF. At the same time, temporal expressions resolved using regular expressions, afterthat the extracted event are categorized into types. Finally, in order to determinewhether the event is significant or not we measure the association between NamedEntity and a specific date based on the number they co-occur. Then the significantevent will appear on the calendar.
Keywords/Search Tags:Event Extraction, Open domain, Part of Speech, Named Entity Extraction, Event Phrase Extraction
PDF Full Text Request
Related items