Font Size: a A A

Research On Keeping Consistency Of Chinese Corpus Of Complete Parsing

Posted on:2012-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:L WeiFull Text:PDF
GTID:2218330368989246Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Parsing is a key technology in natural language processing, it is significant to semantic analysis, machine translation, information retrieval and automatic abstract. Parsing is to analysis the sentence structure and grammatical functions. The purpose is to determine the sentence structure and the relationship between various ingredients. Construct statistical model based on Treebank is an important research direction of Parsing. Parsing model quality was directly influenced by Treebank's mark quality. The accuracy of many current Chinese automatic parsing algorithm is not quite high, the Treebank needs to be amended manually and the result also needs to be checked for consistency. Our consistency check method is checking by multiple people. The amended result need repeat and cross checked to minimize artificially labeled omissions. Then scanned the whole Treebank and checked out the inconsistency labeling. It is to find out the inconsistent mark in identical or similar language environment in Treebank.Based on the construction of Ali Chinese parsing Treebank, we explore the causes and strategies of inconsistent mark and amend 20 000 sentences. Main job is as follows:(1) Formulate Treebank processing standard according to the application requirements. Complete processing standard to ensure the consistency of the marked results. Treebank processing standard have two major parts:mark method and mark sets. Mark method describes how to mark the organization relationship between words in sentence. Annotator need amended Treebank by referencing specific marked examples which were written in standard.(2) Analysis the causes and solutions of the inconsistent phenomenon. One reason is that the standard itself not perfect, we discuss the problems in the course of processing and revise standard constantly. Another reason is artificially labeled omissions, the amended result need consistency check.(3) We propose checking method and solving strategies from there layers:word segmentation, Pos tagging, and parsing structure, the three layers are interacted on each other. Firstly scan the Treebank and use rules method to check word segmentation inconsistency. Then use Clustering method to check Pos tagging inconsistency.(4) Check parsing inconsistency from single-layer parsing mark and multi-layer parsing structure preference relations. We use rules method to check single-layer syntactic function and structure relation mark which based on the error-driven. We check multi-layer parsing inconsistency by using the structure preference relations. Establish language environment support vector machine model. Determine the most appropriate syntactic structure according to the context environment.Experiments show that, word segmentation inconsistency and pos tagging inconsistency significantly influence parsing results. After the word segmentation and Pos tagging consistency check, inconsistent phenomenon decrease a lot. Search and amend parsing inconsistency in the consistency check is difficulty. Through consistency check by artificially and automatic machine, the accuracy of the Treebank has been improved. The check method based on rules and statistics theory. When we applied this method to Ali Treebank, the results of consistency check show that the precision rate is 78.2% and the recall rate is 90.1%. It can improve the accuracy of Treebank annotation by 3%.
Keywords/Search Tags:Chinese Information Processing, Corpus, Complete Parsing, Consistency
PDF Full Text Request
Related items