Font Size: a A A

Annotation And Analysis Of Chinese Financial News Commentaries In Terms Of Rhetorical Structure

Posted on:2007-07-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LeFull Text:PDF
GTID:1115360215977467Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The revival of empirical paradigm and the application of machine learning have made the construction of linguistic resource a crucial task in natural language processing. The improvement in character/word and sentence processing and the ultimate goal of discourse processing have made discourse annotation an international frontier. This dissertation reports my efforts to enrich Chinese language resources through the building of a Chinese news commentary treebank, using the Rhetorical Structure Theory (RST) as its theoretical framework. Following the internationally observed methodology in corpus construction, I first did a pilot study, then on the selected corpus I took necessary steps including pre-processing, segmentation, relation annotation, validity checking and inner-coder agreement test to ensure the quality of the annotated discourses. Driven by the statistics obtained from the finished part of the corpus, I studied various correlations between the rhetorical structure and surface linguistic forms. This study can serve the purpose of providing a priori scores for automatic Chinese text parsers and summarizers, or for quantitative linguistic studies.Specifically, I did the following work:Before setting off to the detailed tasks of corpus construction, I did a theoretical analogy on the similarities and disparities between the English-rooted RST and Chinese traditional linguistic studies on Sentence Complexes (Fuju), Sentence Groups (Juqun), Discourse and Literary Composition (Wenzhang Xue). Various evidences show that the two schools have common grounds on the hypotheses on discourse structure and many specific observations, but RST is more consistent in its communicative perspective on language, and thereby lays more emphasis on the tie between writer's intentions and the nucleus status of discourse units, is more insistent on homogeneity among layers of discourse units, and makes more efforts on formalization. The analogy, together with a review on Chinese RST studies and international RST treebank achievements, proved the plausibility and necessity to do a large-scale, corpus-based analysis on Chinese texts.For that purpose, I composed a Chinese financial news commentary corpus (Caijingpinglun, CJPL) with 400 news texts of about 780,000 characters. Mainly made up of financial news reports and commentaries, this CJPL corpus is of fair comparability to the English WSJ-RST treebank made up of Wall Street Journal articles, and to the German PCC treebank made up of Maerkische Allegemeine Zeitung commentary articles. Upon finishing the pre-processing steps, I first tagged, as an ordinary reader, basic documentary information to every text in CJPL, including Genre, Topic, Title, Lead, Opening, Ending, Source, Author, Publisher, and so on.Then I carried out a semi-automatic segmentation procedure based on selected EUDA (Elementary Unit of Discourse Analysis) delimiters, namely Full-stop, Question-mark, Exclamation-mark, End-of-paragraph sign, Semicolon, Colon, Ellipsis and Dash. The selection of these delimiters were based on a corpus study on their distribution, which revealed that they can not only signal the boundaries of discourse units in the majority cases but also effectively help reduce the granularity of later discourse analysis. This segmentation procedure yielded undisputable segments, which only need occasional rebinding but no further hand segmentation.After segmentation, I did a trial annotation to all the inter-EUDA relations of the 400 texts up to the completion of a discourse tree covering the whole text. By that time I felt to have gained fair understanding of my texts. Then I exclude 2 pieces of questionable integrity and 3 lengthy TV interview transcripts of mainly oral exchanges.Rooted in the linguistic facts of my corpus, I drafted, together with their corresponding definitions, a Chinese rhetorical relation set of 47 relations. I've also drafted an inventory of 19 scheme elements for news texts, and a working manual of how to cope with typical problems in relation tagging. While composing the definitions and the manual, I made constant references to various rhetorical inventories and traditional Chinese studies. Apart from that, I conducted a psycholinguistic study on native speakers'preference for certain structures and relation definitions.Based on the above-mentioned trial tagging, I annotated 97 shortest documents of 197 randomly selected ones from the 395 qualified corpus texts, following relation definitions and tagging conventions drafted. Each of the 97 documents was annotated twice and, when the whole lot was finished, checked for Tree structure validity. A third-time annotation was done to unify choices made in the first- and second-round of annotation, followed by an inner-coder consistency test and extraction of data for statistic analysis.Apart from rhetorical annotation, I also tagged inter-EUDA cue phrases (including inter-EUDA connectives and connectors, inter-EUDA deictic anaphora and pronoun anaphora, as well as discursively functioning orthographical marks). The tagging was done without reference to the rhetorical annotation.Data extracted from the completed portion of CJPL corpus suggest the following points:1) Following certain principles and conventions, the majority of Chinese news commentaries (93.1%) can be represented by a Tree structure;2) That the rhetorical relations (RRs) defined can be recursively applied to different layers of Chinese discourse units, demonstrates good homogeneity of Chinese text structure.3) The Extended relations of the Classic RST set (Mann and Thompson 1988, Mann 2005) cover 90.4% of all the cases in the Chinese Financial News Commentary Corpus, and the rest can be covered by deviations of those known RRs.4) The most popular overall Tree structure in CJPL is an opening sentence as Nucleus with a Satellite of multi-nuclear relations (14.4%), followed by an opening sentence as Nucleus with a satellite of mono-nucleic nucleus (13.4%), an opening sentence as Satellite with a nucleus of mono-nucleic nucleus (13.4%), and an opening sentence as Satellite with a nucleus of mono-nucleic satellite and a closing nucleus (11.3%).5) 53.6% of the root relations of the body of Chinese news commentaries are of JUSTIFY, EVALUATION and other presentational relations, suggesting a wide difference between practical definition of Commentary in the Chinese mass media community and the theoretical definition given by linguists.6) Despite a high percentage (35.4%) of multi-nuclear relations, the hypotactic mononuclear relations still withhold their majority.7) Quite different from the assumed overwhelming pattern of N-S order within Chinese sentence complexes, there is apparently no such dominant order among and above sentences delimited by our selected markers.8) The different distribution patterns of RRs in the commentary-dominated CJPL corpus and in a report-dominated Chinese TV news report corpus (Xinwenlianbo, XWLB) suggest the influence of genre on RR distribution.9) About 28.5% of all inter-EUDA relations are marked with conjunction or connectors in CJPL, with the most frequently used being"而(ER)", and the most frequently marked relations being CONJUNCTION-M and LIST-M.10) Some relations, such as CONJUNCTION-M, CONCESSION-M/N/S and LIST-M, are frequently marked with conjunctions; while some other relations, such as MEANS-S, ATTRIBUTION-S, EVALUATION-M/N, INTERPRETAION-N, SOLUTION-M/N, are rarely marked with conjunctions.11) Some common conjunctions are found to be used both below and above sentence level in CJPL, but with different distribution patterns, some are obviously much more frequently used below sentence level, some much more frequently above sentence level, and some no significant difference below and above sentence level.12) Some inter-EUDA connectors are found to be used consecutively in CJPL texts, and their functions are mainly of the following three types: mitigating or amplifying modality, restricting each other's rhetorical potentials, and governing different discursive units.13) The most frequently used inter-EUDA deictic expressions are"这(ZHE)"and words or phrases started with"这(ZHE)".14) Some punctuation marks, such as question-mark, semicolon and colon, have strong correlation with certain RRs and their nuclarity patterns.15) Despite some strong correlations, there is no one-to-one mapping between discursive cue phrases (connectives, connectors, anaphoric deixis, punctuation marks) and RRs.16) A characteristic subtree structure of spiral shape has been identified in CJPL trees. In this structure, a discourse unit always relates itself not to its immediate neighbor, but to the most distant unit in the subtree. If this was what Kaplan (1966) meant to be the typical circular Chinese discourse structure, and if more cases are found in longer CJPL texts up to a significant level, we could say Kaplan held at least part of the truth. Although this Chinese RST treebank project has only completed partially, it promises practical values in discourse studies and cultural studies as well as in Chinese information processing:First of all, it is the first attempt to build an RST-annotated Chinese discourse treebank. Given other layers of linguistic information in the near future, this corpus can be used for the extraction of necessary a priori scores needed in Chinese summarizers and be used as a platform for training and testing statistics-based discourse parsers. Therefore, this Chinese RST treebank will serve as an ideal testbed for Chinese computer scientists to catch up with their international competitors in discourse processing.Secondly, our annotation efforts have proved on a fairly large scale the cross-language transferability of RST and its formalization. Meanwhile, new territories for studies on Chinese cue phrases are also explored. Exciting findings can be expected in Chinese discourse studies after the completion of this treebank. And given some comparable corpora in other languages or other genres, this corpus could also be used as an empirical database for contrastive rhetorical studies.Finally, we can also predict the usage of annotated Chinese discourse corpus in social sciences, corpus-driven studies on journalism or pragmatics, for instance.
Keywords/Search Tags:Natural Language Understanding, Morden Chinese, discourse corpus, Rhetrocial Structure Theory, news text, financial news
PDF Full Text Request
Related items