Font Size: a A A

Quantitative Analysis Of Self-bulit Multi-domain Chinese Dependency Treebank

Posted on:2017-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:L L ShiFull Text:PDF
GTID:2295330485494752Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Treebank is a corpus that is annotated with syntactic information, and an important resource for natural language processing. In corpus linguistics and computational linguistics, the treebank can be used to study various grammatical phenomena and overall features of language; in computational linguistics, it can be used for training and testing the parser. At present, most of the studies are of a single domain treebank. Based on the construction of multi-domain dependency treebank, this thesis studies and analyzes the differences between the different areas from the perspective of modern Chinese syntax.The thesis consists of five chapters:The first chapter is the introduction part, mainly including background, object, content, methods and value. The research background includes the status of the construction of the treebank resources and the linguistics studies based on treebank.The second chapter is the construction of multi dependency treebank. It contains five domains, such as journalism, micro-blog, colloquialism, medicine and patent. We use the Peking University of Multi-view Chinese Treebank (PMT) annotation system, and set up 26 part of speech tags and 30 dependency category labels.The third chapter detects and analyses the errors in dependency treebank automatically. The quality of the treebank is positively related with parsing accuracy, so the analysis of the errors plays an important role in improving the quality of the treebank. We make statistical analysis on the errors in the second and third revision of the universal treebank. The error types in the second revision are concentrated in three aspects:part of speech, grammatical unit and grammatical structure. The third detection is based on manual rules, and the errors mainly belong to three types:word segmentation error, mismatching between POS and syntactic role, and syntactic role error.The fourth chapter is the quantitative comparison and analysis of the multi-domain dependency treebank. We have chosen 5 aspects to study the differences between the domains, which are parts of speech, dependency relations, syntactic function of parts of speech, the word classes that are used to fulfill certain syntactic function, generative rules. Micro-blog and colloquial domain, medicine and patent domains have little difference in POS, meanwhile journalism domain shows the comprehensive characteristics. In terms of dependency relations, there are no IOB and RED in the domain of medicine and patent. The number of HED and ADV in micro-blog and colloquialism is more than the other three areas. There is a large amount of MT in the journalism area. In view of the syntactic function of POS, nouns have a stronger tendency to function as the object, and pronouns have a stronger tendency to be the subject, and verbs and adjectives are more often used as the predicate in micro-blog and colloquialism. From the theory of Correlated Markedness, nouns and adjectives are different in every domain, and are inconsistent with the previous conclusions. The proportion of adjective to be the attribute is far lower than our expection. There are differences in POS to be the adverbial. Adverbs are in dominant in micro-blog and colloquialism, meanwhile adverbs and prepositions have equal shares in journalism and medicine, and prepositions occupy the advantage in the domain of patent. In the five domains of generative rules, the same syntactic structure has the same sequences of POS, but also has its own special sequence. These linguistic knowledge not only tests the previous research conclusion, but also can make up for the deficiency in the existing theories of linguistics.The fifth chapter is the conclusion. We review and summarize the paper, meanwhile point out some deficiencies and make a prospect of future research.
Keywords/Search Tags:Dependency treebank, multi-domain, error analysis, syntax analysis
PDF Full Text Request
Related items