A discourse-oriented approach to automatic Chinese zero anaphora resolution

Anaphora resolution is a central task in natural language understanding systems. For Chinese, a major challenge is zero anaphora (ZA). The existing automatic Chinese ZA resolution algorithms, despite their differences, mainly rely on syntactic factors in resolving ZA. In contrast, linguistic studies show that discourse information such as topic is important in Chinese ZA resolution.;This study is intended to address this discrepancy by incorporating topic features into the proposed automatic ZA resolution algorithm. The corpus for this study is Converse (2006a). A machine-learning approach is adopted.;First, topic structures were annotated in the training data by two native speaker Chinese graduate students trained in linguistics. Each file in the training data was independently annotated and then adjudicated.;Then, four rounds of machine-learning algorithm were implemented: (1) Round 1, the baseline, used Zhao & Ng's (2007) 26 features (mainly syntactic information); (2) Round 2 added the manually annotated topics as one feature to the baseline; (3) Round 3 used 21 topic-related features automatically extracted from the corpus; (4) Error analysis was then conducted, on the basis of which 25 topic-related features were re-selected in Round 4.;The performances of the four rounds were compared in three ways: (i) by running 5-fold cross validation on the training data; (ii) by running the trained model on the test data; and (iii) by conducting ROC analysis. Results show that the use of manually annotated topics (Round 2) and carefully-selected topic-related features (Round 4) does help improve ZA resolution (e.g., on test data, the F-measure of Round 4 was 0.582, 29% higher than the baseline (0.452). McNemar's test shows that the error rate of Round 4 was significantly lower than the baseline (p<0.01), the odds ratio being 3.0). In addition, Round 4 achieved similar results to Round 2: Because the features in Round 4 can be automatically extracted, it is cheaper and therefore more practical for ZA resolution than hand annotation.;This study therefore demonstrates that automatic Chinese ZA resolution can be improved beyond previous approaches by including in the model those syntactic features highly correlated with the discourse concept of topic.
