Font Size: a A A

Research Of Chinese Phrase Identification Based On Conditional Random Fields

Posted on:2009-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y S GuoFull Text:PDF
GTID:2178360308979368Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of technology of machine translation, the requirements of a full parsing is becoming more and more demanding. Parsing is a basic technique in natural language processing; however, a full parser is usually costly and slow. Recently, Phrase identification has been applied to various information processing systems. Compared to the performance of full parsers, a phrase identifier is much faster and the result is more useful for various applications, such as information retrieval and extraction, question answering, and automatic document summarization.Based on the definition of English chunk given by Abney and the linguistic characters of Chinese itself and the feature of Upenn ChineseTreeBank 5.1 labeling, the author defines Chinese chunk as a single semantic and non-recursive core of an intra-clausal constituent, with the restriction that no chunks are included in another chunk. Under the conduct of the definition, the paper gets the training data and testing data from Penn Chinese Treebank 5.1. In the thesis original principle of conditional random fields was presented. Compared with other tranditional statistical language model in theory and practice, Conditional Random Fields is suitable for sequential labeling task with excellent performance. This paper designed and implemented a system of phrase identification and the phrases in test corpus were identified.From a series of experiment results, the performance improves very quickly when the size of the training data gradually reaches twice of that of testing data, however, the range of advancement of performance becomes less while the trend is still upwards. So we can make the conclusion that larger scale of corpus contributes to the performance of recognition of Chinese phrases. Besides, we can see the performance on the second-order CRFs isn't more better than that on the first-order CRFs. The reason is that there is more context features are used in the second-order CRF while the problem of data sparseness is serious. In the paper, we propose a novel methodology for Chinese phrases identification, which simplifies the assignment of full parsing, and is favorable for the quick application of full parsing in the large-scale real document processing system.
Keywords/Search Tags:Phrase Identification, Syntactic Parsing, Conditional Random Fields, Machine Translation
PDF Full Text Request
Related items