Font Size: a A A

Research On Chinese-Japanese News Corpus Construction Using Event Extraction

Posted on:2017-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2348330512480140Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of statistical techniques,large-scale bilingual corpora is indispensable fundamental resource for machine translation and cross-language processing.Parallel bilingual corpus provides rich matching information of two corresponding languages.Usually,It is difficult to acquire high-quality and large-scale parallel bilingual corpus.One of the mainstream method of cross language information processing is using bilingual corpora to construct translation equivalents,bilingual dictionary or bilingual named entities equivalents,to support machine translation and cross-language information retrieval.However,the existing resource of bilingual corpora is usually significant lack.In recent years,mining bilingual corpora from website is becoming more and more important.In particular,there are many high quality multilingual news resources appearing on the Internet.As is well known,each news is mainly based on narration.In the case that a news is translated into different languages,the information of time,place,person and organization must be strictly matched.Therefore,it is a good idea to use these information to construct bilingual comparable corpus.Traditional methods of comparable corpus construction are usually adopted by Web structure information,similarity calculation,cross language information retrieval and Wikipedia link etc.In this paper,we proposed a method to construct Chinese-Japanese news comparable corpus using event extraction technologies.Firstly,we implement the word segmentation and named entity recognition through the CRF model,and construct the named entity dictionary by named entity matching.We extract Chinese and Japanese news using web crawler,then to extract news feature sets according to event extraction technology which combined with the Japanese-Chinese dictionary,named-entity dictionary,and Hanzi-Kanji mapping table of Japanese-Chinese characters,by calculating the similarity of the extracted news events,we realize a method of similarity detection using the feature of Japanese-Chinese news events and generate the extraction results of bilingual document alignment.Finally,we use the extraction results to train classifier model,which is used for identification of document alignment of Japanese-Chinese news.Experimental results show that our method is effective and it can overcome the shortcoming of traditional methods.
Keywords/Search Tags:Japanese-Chinese News, Document alignment, Event extraction, Named entities, Japanese-Chinese dictionary
PDF Full Text Request
Related items