Chinese Multiword Expression Extraction

Posted on:2014-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y Di

Full Text:PDF

GTID:2268330401469476

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Multiword Expressions refer to those relatively complete semantic units which consist of two or more words with syntactic and semantic association. Multiword Expression is one of the most intractable problems in the field of Natural Language Processing. It will lead to increasing segmentation and semantic error, thereby affects the performance of the practical application of Machine Translation or Parsing."Verb+Noun" and "Gerund+Noun" Multiword Expressions are the most common Multiword Expressions in Chinese Multiword Expressions. Therefore, this thesis mainly treat Chinese "verb+noun" and "gerund+noun" Multiword Expression as the study object to do study that including Multiword Expression Extraction and Application. The main contents of this research are the following aspects:1, Extracting Multiword Expression candidates based on the bilingual Sino-British sentence aligned corpus. The boundary of the Indo-European language can help determine the boundary of the complete semantic unit of Chinese. In this thesis, we use the phrase alignment tool Moses to do alignment process of the sentence aligned corpus firstly. Then extract Multiword Expression candidates after the first step. At last, compare with the word alignment tool GIZA++. In this experiment, after extracting Multiword Expression candidates from Peking University Sino-British aligned corpus, we get the result of F-measure value based on phrase alignment tools55.95%and based on word alignment is45.32%.2, Extracting Multiword Expression candidates based on the syntactic analysis method. Parsing is able to identify the structure of the sentence and the processing of the language can achieve the depth of the internal structure of the language. In this thesis, we use three parsers:HIT parser, Berkeley parser, the Stanford parser to do parsing for Chinese part of Peking aligned corpus. Then extract specific dependencies words as Multiword Expression candidates, the right rates are42.40%for HIT,41.00%for Berkeley,39.73%for Stanford respectively.3, Constructing the Chinese "verb+noun" and "gerund+noun" Multiword Expressions Knowledge Base. Constructing process includes two steps:statistical filtering and classification. Dealing with large-scale corpus are more convenient based on statistical methods, and does not rely on specific areas. This thesis uses internal metric mutual information, the external measurement method C-value to filter Multiword Expression candidates. Classification classifies candidates into VOB and ATT to build classified knowledge base, which has important significance on subsequent research.4, Parsing error correction. This thesis uses Chinese Multiword Expressions automatically extracted to do parsing error correction in HIT parser syntax analysis. As a result, the error correction rate of HIT parser with our Chinese Multiword Expressions is98.87%for ATT constructions and99.98%for VOB constructions.

Keywords/Search Tags:

Multiword Expression, MWE Extraction, Phrase Alignment, MWEClassification, Parsing Error Correction

PDF Full Text Request

Related items

1	Chinese Multiword Expression Extraction And Application On Chinese Dependency Parsing
2	Multiword Expressions: Extraction And Applications
3	Rule-based And Statistical-based Combination Of Bilingual Parallel Sentence, The Phrase Alignment Method
4	Hierarchical Multiword Expression-Based Text Matching Research
5	Studies On The Usage Of Preposition And Conjunction In Phrase Structure Syntactic Parsing
6	Study On Several Key Problems In The Training Process Of Phrase-based Statistical Machine Translation
7	Chinese Prepositional Phrase Recognition Based On Fine-grained Phrase Information
8	Research On Chinese Phrase Structure Syntactic Parsing Based On CVG
9	Research On Chinese Text Error Correction Based On N-gram And Dependency Parsing
10	Research Of Dynamic Facial Expression Recognition Technology