Containing The Longest Noun Phrase Automatic Identification

Posted on:2008-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:X F Qian

Full Text:PDF

GTID:2205360215954452

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

The identification of "Maximal Noun Phrase (MNP)" can supply the Automatic Parsing system and the Machine Translation system with strong supports. Former researches focused on the approaches of identifying the boundaries of the phrase while lacked an in-depth study of MNP itself. As what Chinese Grammer researches shows, most modifier-core structures can be divided into agglutinating-style structures and assorted structures, according to the existence of "de". Looking inside the two structures, because "de-Phrase" exits, these assorted structures with "de" can accept more parts of speech as well as syntactic structures. And looking outside, their syntax actions also have some differences. Therefore, Chinese MNP should be divided into two types: the MNP which contains "de" (deMNP), and the MNP which doesn't contain "de". This paper first investigates the inside construction, the syntax distribution, and the linear distribution of deMNP comprehensively, then it advances a strategy of "Identify the right boundary first, then identify the left one", it also makes a further research on the identification of deMNP.This paper includes two parts. One studies the automatic identification of the deMNP which contains "de-Phrase" rather than modifier-core structures with "de". In this part we analyse the different features of the right boundaries as well as the left boundaries of the phrase comprehensively, and recognize the two boundaries by the method of "Boundary Distribution Probability". The other part studies the automatic identification of the deMNP which contains modifier-core structure with "de". It also discusses the features of the phrase boundary, and it transforms the phrase identification task into another that is to recognize the syntax Subject and the syntax Object. This part also adopts the method "Boundary Distribution Probability" to recognize the right boundary. Furthermore, we bring forward a Collocation Model to recognize the left boundary. This Model refers to four collocation types: preposition frame, preposition-verb collocation, preposition-object collocation, and verb-object collocation.The paper adopts two methods to resolve the problem of data sparse. One is the "Compound Model", and the other is the "Training algorithm instructed by rules". The Compound Model optimized the model data obviously by backing off to the history equivalent class, such as conditional sub-probability, relavant frequency, and semantic class. To solve the insufficient amount of collocation items in the training corpus, the training algorithm instructed by rules get collocations directly from the test corpus using three rules. And it raises the recall rate by more than 27%.A corpus (about 0.64 million characters) of news is used for data training and another (about 0.32 million characters) is used for test. The whole identification system achieves about 70.42% in F-score. In strategy, the identification of the right boundary tags more than 91 percent objects, which efficiently supports the identification of the left one. And the latter achieves about 76.16% in F-score. As the quality of the collocation data improves, the system can get a better effect in expectation.

Keywords/Search Tags:

MNP, Automatic Identification, Boundary Distribution Probability, Collocation Model, Chinese Information Processing

PDF Full Text Request

Related items

1	The Automatic Identification Research Of Preposition "Dao" And Structure For Information Processing
2	A Study On The Automatic Acquisition Of Verb-Object Collocation For Chinese Language
3	In Modern Chinese Language Structure Of Automatic Identification
4	Information Processing-oriented Analysis On Preposition Phrase "Wang+X" And Automatic Identification In Computer
5	Modern Chinese Verb With More Than The Perspective Of Its Automatic Identification
6	Analysis And Study Of The Characteristics Of Chinese Three-part Causative Complexes Based On Relational Word Collocation
7	A Preliminary Study On Automatic Recognition Of Improper Chinese Collocation
8	A Study On Boundary Recognition Of Modern Chinese Prepositional Phrase
9	For Information Processing With A Typical Prefix Derived From The Word Recognition Analysis
10	Chinese As A Second Language Acquisition Study On The "V+O_Non-patient" Collocations