Font Size: a A A

Corpus Construction And Research For Hedges Detection In Chinese Wikipedia

Posted on:2015-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:C Q ZengFull Text:PDF
GTID:2298330467984719Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Hedges are important linguistic phenomenon, which indicate speculative and uncertain. Information guided by hedges is hedges information. Hedges information detection is the first step of information extraction, which can distinguish hedges information from factual information and help to extract factual information. Hedges information corpus just play a key role in hedges information detection process. At present, there are some English hedges corpuses, which promoted the improvement of English hedges detection. But there is none in Chinese. This paper try to research and construct a Chinese corpus with hedges and their scope annotation based on the Chinese Wikipedia large-scale real text.This paper first carried on the corpus design, then constructed the corpus and carried on the statistical analysis. The corpus design was carried out according to the feature of Chinese Wikipedia text and referred some similar English corpuses. The balance structure and the scale of the corpus was planed, and the sampling principle and logic structure of the corpus was designed in detail. When designed sampling principle, the author used Chinese Wikipedia text’s five weasel template tags to ensure sufficient representative hedges sentences collection, taking a paragraph as a sample size. When designed the logical structure, the author defined all needful XML elements and their attributes to organize corpus in XML format.To reduce effectively the workload of annotation, this paper used maximum matching based on dictionary to implement initial hedges annotation of corpus:First, the author carried out Chinese Wikipedia corpus collection, and extensively collected hedges to create a keyword dictionary by finding hedges in literature about hedges and using the synonyms dictionary to find more hedges. And then, the author implement the backward maximum matching program based on the keyword dictionary according to hedges’features. Through the analysis of error tagging, the author used Chinese word segmentation tool to improve the backward maximum matching method and avoid combination type of ambiguity effectively.To finish the corpus annotation, this paper implemented the manual correction of initial hedges annotation and manual annotation of hedges’scope. The author put forward the specific annotation principle. When annotated hedges, the author obeyed "minimum principle" based on minimal word unit. When annotated hedges scope, the author obeyed "maximum principle" based on maximal, possible syntactic unit. Statistical data of the corpus annotation shows that backward maximum matching based on word segmentation effectively improves the annotation accuracy when it was compared with backward maximum matching, and the corpus basically reached balance structure design goal.The achievement of this paper can be applied to the detection research of hedges and their scope, and help to promote the development of Chinese text information extraction research. The achievement can also be important resource of the research for the semantics, grammar and pragmatics of Chinese hedges. At the same time, this paper can be the reference of research on the construction of such a corpus.
Keywords/Search Tags:Natural Language Processing, Corpus, Chinese Wikipedia, Hedges
PDF Full Text Request
Related items