Font Size: a A A

Chinese Multiword Expression Extraction

Posted on:2014-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y DiFull Text:PDF
GTID:2268330401469476Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Multiword Expressions refer to those relatively complete semantic units which consist of two or more words with syntactic and semantic association. Multiword Expression is one of the most intractable problems in the field of Natural Language Processing. It will lead to increasing segmentation and semantic error, thereby affects the performance of the practical application of Machine Translation or Parsing."Verb+Noun" and "Gerund+Noun" Multiword Expressions are the most common Multiword Expressions in Chinese Multiword Expressions. Therefore, this thesis mainly treat Chinese "verb+noun" and "gerund+noun" Multiword Expression as the study object to do study that including Multiword Expression Extraction and Application. The main contents of this research are the following aspects:1, Extracting Multiword Expression candidates based on the bilingual Sino-British sentence aligned corpus. The boundary of the Indo-European language can help determine the boundary of the complete semantic unit of Chinese. In this thesis, we use the phrase alignment tool Moses to do alignment process of the sentence aligned corpus firstly. Then extract Multiword Expression candidates after the first step. At last, compare with the word alignment tool GIZA++. In this experiment, after extracting Multiword Expression candidates from Peking University Sino-British aligned corpus, we get the result of F-measure value based on phrase alignment tools55.95%and based on word alignment is45.32%.2, Extracting Multiword Expression candidates based on the syntactic analysis method. Parsing is able to identify the structure of the sentence and the processing of the language can achieve the depth of the internal structure of the language. In this thesis, we use three parsers:HIT parser, Berkeley parser, the Stanford parser to do parsing for Chinese part of Peking aligned corpus. Then extract specific dependencies words as Multiword Expression candidates, the right rates are42.40%for HIT,41.00%for Berkeley,39.73%for Stanford respectively.3, Constructing the Chinese "verb+noun" and "gerund+noun" Multiword Expressions Knowledge Base. Constructing process includes two steps:statistical filtering and classification. Dealing with large-scale corpus are more convenient based on statistical methods, and does not rely on specific areas. This thesis uses internal metric mutual information, the external measurement method C-value to filter Multiword Expression candidates. Classification classifies candidates into VOB and ATT to build classified knowledge base, which has important significance on subsequent research.4, Parsing error correction. This thesis uses Chinese Multiword Expressions automatically extracted to do parsing error correction in HIT parser syntax analysis. As a result, the error correction rate of HIT parser with our Chinese Multiword Expressions is98.87%for ATT constructions and99.98%for VOB constructions.
Keywords/Search Tags:Multiword Expression, MWE Extraction, Phrase Alignment, MWEClassification, Parsing Error Correction
PDF Full Text Request
Related items