Research Of Chinese Phrase Identification Based On Conditional Random Fields

Posted on:2009-01-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y S Guo

Full Text:PDF

GTID:2178360308979368

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of technology of machine translation, the requirements of a full parsing is becoming more and more demanding. Parsing is a basic technique in natural language processing; however, a full parser is usually costly and slow. Recently, Phrase identification has been applied to various information processing systems. Compared to the performance of full parsers, a phrase identifier is much faster and the result is more useful for various applications, such as information retrieval and extraction, question answering, and automatic document summarization.Based on the definition of English chunk given by Abney and the linguistic characters of Chinese itself and the feature of Upenn ChineseTreeBank 5.1 labeling, the author defines Chinese chunk as a single semantic and non-recursive core of an intra-clausal constituent, with the restriction that no chunks are included in another chunk. Under the conduct of the definition, the paper gets the training data and testing data from Penn Chinese Treebank 5.1. In the thesis original principle of conditional random fields was presented. Compared with other tranditional statistical language model in theory and practice, Conditional Random Fields is suitable for sequential labeling task with excellent performance. This paper designed and implemented a system of phrase identification and the phrases in test corpus were identified.From a series of experiment results, the performance improves very quickly when the size of the training data gradually reaches twice of that of testing data, however, the range of advancement of performance becomes less while the trend is still upwards. So we can make the conclusion that larger scale of corpus contributes to the performance of recognition of Chinese phrases. Besides, we can see the performance on the second-order CRFs isn't more better than that on the first-order CRFs. The reason is that there is more context features are used in the second-order CRF while the problem of data sparseness is serious. In the paper, we propose a novel methodology for Chinese phrases identification, which simplifies the assignment of full parsing, and is favorable for the quick application of full parsing in the large-scale real document processing system.

Keywords/Search Tags:

Phrase Identification, Syntactic Parsing, Conditional Random Fields, Machine Translation

PDF Full Text Request

Related items

1	Research On Chinese Syntactic Parsing Based On Cascaded Conditional Random Fields
2	Research On Chinese Prepositional Phrase Identification Based On Multi-layer Conditional Random Fields
3	Research On Key Technologies Of English-Chinese Machine Translation System
4	Research On Chinese Preposition Phrase Identification Based On Cascaded Conditional Random Fields
5	Machine Translation Based On Phrase Template
6	Studies On The Usage Of Preposition And Conjunction In Phrase Structure Syntactic Parsing
7	Automatic Recognition And Parsing Of Chinese Maximal-Length Noun Phrase
8	The Research On Bilingual Syntactic Phrase-based Statistical Machine Translation
9	Research Of Phrase-based Translation Model Using Syntactic And Morphologic Information
10	Chunk Based Chinese Syntactic Parsing And Its Application