Font Size: a A A

Research On Chinese Sentence Compression

Posted on:2015-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:X JiangFull Text:PDF
GTID:2348330473953711Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the wide application and rapid development of computer and Internet, Natural Language Processing enters into an era of rapid development. At the same time, how to obtain valuable information from massive data quickly and accurately is attracting more and more interest from researchers. As the basic technology to solve this problem, sentence com-pression technology has been of great practical value. Sentence compression can be applied to automatic summarization, question answering, machine translation and many other tasks.In this thesis, a specification for Chinese sentence compression is proposed. According to this specification, we develop a corpus of Chinese sentence compression with manual annota-tion. Using this corpus, we construct an automatic compression system for Chinese, and evalu-ate the results of compression with human and automatic metrics.The main contributions of this thesis include these following aspects:(1) First of all, in the view of the mainstream research direction of sentence compression task focused on a supervised learning approach, but lacking large-scale parallel corpus for the task, hence we propose a Chinese sentence compression corpus annotation specification base on the Chinese language structure. Then we construct a corpus-NEUCSS (3308 parallel sen-tence pairs), with the guidance of the specification. NEUCSS is the first manual annotated cor-pus for Chinese sentence compression task, which provide a data foundation for the future work in this field. In addition, this thesis also introduces the whole corpus annotation process and the quality control method.(2) This thesis builds a Chinese automatic sentence compression system using the NEUCSS corpus. The system is based on synchronous tree substitution grammar. Synchronous rules are extracted from parallel syntactic trees which are generated in pre-processing module. Model parameters are learned with Structured SVM and in decoding module these parameters help system obtain the final compression results.(3)This thesis studies the evaluation metrics of the Chinese sentence compression task: human judgments and automatic evaluation metrics. Since human judgments result is accurate and reliable, most of the previous work use human judgments. Therefore, this thesis evaluates the compression results with human judgments on grammaticality and importance. However, the cost of human judgments is too high. Therefore, this thesis introduces such automatic met-rics:Compression Rate, BLEU, NIST, GTM, WER, PER, TER and Relations F1. At last, the experiments shows the correlation between these several automatic evaluation metrics and hu-man judgments.
Keywords/Search Tags:Natural Language processing, Sentence Compression, Corpus, Annotation speci- fication, Automatic Evaluation Metrics
PDF Full Text Request
Related items