Font Size: a A A

Containing The Longest Noun Phrase Automatic Identification

Posted on:2008-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:X F QianFull Text:PDF
GTID:2205360215954452Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The identification of "Maximal Noun Phrase (MNP)" can supply the Automatic Parsing system and the Machine Translation system with strong supports. Former researches focused on the approaches of identifying the boundaries of the phrase while lacked an in-depth study of MNP itself. As what Chinese Grammer researches shows, most modifier-core structures can be divided into agglutinating-style structures and assorted structures, according to the existence of "de". Looking inside the two structures, because "de-Phrase" exits, these assorted structures with "de" can accept more parts of speech as well as syntactic structures. And looking outside, their syntax actions also have some differences. Therefore, Chinese MNP should be divided into two types: the MNP which contains "de" (deMNP), and the MNP which doesn't contain "de". This paper first investigates the inside construction, the syntax distribution, and the linear distribution of deMNP comprehensively, then it advances a strategy of "Identify the right boundary first, then identify the left one", it also makes a further research on the identification of deMNP.This paper includes two parts. One studies the automatic identification of the deMNP which contains "de-Phrase" rather than modifier-core structures with "de". In this part we analyse the different features of the right boundaries as well as the left boundaries of the phrase comprehensively, and recognize the two boundaries by the method of "Boundary Distribution Probability". The other part studies the automatic identification of the deMNP which contains modifier-core structure with "de". It also discusses the features of the phrase boundary, and it transforms the phrase identification task into another that is to recognize the syntax Subject and the syntax Object. This part also adopts the method "Boundary Distribution Probability" to recognize the right boundary. Furthermore, we bring forward a Collocation Model to recognize the left boundary. This Model refers to four collocation types: preposition frame, preposition-verb collocation, preposition-object collocation, and verb-object collocation.The paper adopts two methods to resolve the problem of data sparse. One is the "Compound Model", and the other is the "Training algorithm instructed by rules". The Compound Model optimized the model data obviously by backing off to the history equivalent class, such as conditional sub-probability, relavant frequency, and semantic class. To solve the insufficient amount of collocation items in the training corpus, the training algorithm instructed by rules get collocations directly from the test corpus using three rules. And it raises the recall rate by more than 27%.A corpus (about 0.64 million characters) of news is used for data training and another (about 0.32 million characters) is used for test. The whole identification system achieves about 70.42% in F-score. In strategy, the identification of the right boundary tags more than 91 percent objects, which efficiently supports the identification of the left one. And the latter achieves about 76.16% in F-score. As the quality of the collocation data improves, the system can get a better effect in expectation.
Keywords/Search Tags:MNP, Automatic Identification, Boundary Distribution Probability, Collocation Model, Chinese Information Processing
PDF Full Text Request
Related items