Font Size: a A A

Research On Wikipedia-based Chinese Cross-document Co-reference Resolution

Posted on:2015-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:X M XuFull Text:PDF
GTID:2298330428498409Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As a significant component of information extraction and information fusion, researchon Cross-Document Co-reference Resolution (CDCR) has received extensive attention.The main task of entity cross-document co-reference resolution is to solve the problem ofname variation and ambiguity. The former means that an entity has multiple names, whilethe latter means that multiple entities have the same name. The lack of large-scale Chinesecross-document co-reference corpora impedes its research. Therefore this paper conductsthe following research:1. Construct a Chinese cross-document co-reference corpus based on Wikipedia, andanalyze co-reference phenomena on the corpus, laying the foundation forlarge-scale CDCR. Statistics show that, compared with news domain, the problemof name variation is much more severe than ambiguity for Wikipedia entities.2. Implement a CDCR system using vector space model (VSM) and unsupervisedclustering. Experimental results show that the similarity score between mentionshas more effect on the system performance than that of space vectors.3. Investigate CDCR techniques on a large-scale CDCR corpus. The strategy of“divide and conquer” is adopted to partition all mentions in one particular entitytypes into different exclusive blocks with every block clustered independently,mitigating the time and space complexity brought about by CDCR on a large-scalecorpus. Experiments show that the overall CDCR time is significantly reducedwhile the performance maintains a reasonable level.
Keywords/Search Tags:CDCR, VSM, Clustering, large-scaled corpora, Divide and Conquer
PDF Full Text Request
Related items