Font Size: a A A

Research On Cross-lingual Word Similarity Computation

Posted on:2012-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:T T ZhaoFull Text:PDF
GTID:2218330368991826Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cross-lingual word similarity (CLWS) reflects semantic similarity between two words in different languages, which is a basic component in cross-lingual information access systems. Very recently, CLWS research started to attract attention when multi-lingual content is found surprisingly huge on the Internet. In this paper, we focus on the research of measuring the semantic similarity between words in Chinese and English,we found HowNet is a promising knowledge base for the CLWS measure, and parallel corpus is promising to fine-tune the CLWS using cross-lingual co-occurrence statistics.In this paper we adopted HowNet as cross-lingual knowledge base. The HowNet-based CLWS measure is similar to that used in monolingual measures, which based on the concept definition of word and the hierarchical structure of the sememe in HowNet. After getting the concept definition of the word, we can measure the semantic similarity between the sememes or the words. The experiment results indicate that HowNet is a promising knowledge base for CLWS measure, and the definition failures in HowNet infect the performance of the CLWS measure much.To improve the accuracy of the CLWS measure, in this paper, we adopted parallel corpus as development data, First, we got the co-occurrence words and rank them based on the PMI information, and we present some measures to compute the similarity between the co-occurrence words, then we combined the result of HowNet-based CLWS measure with that of the corpus-based CLWS measure, the performance of the CLWS measure has been improved, in addition, the experiment results indicate that the increasing of corpus size can make the performance of the CLWS measure even better.The evaluation method of CLWS measure is compare the similarity results with human-judged similarity, but there is no benchmark cross-lingual dataset which is available for Chinese-English CLWS evaluation. In this paper, we invited language exports to extend the Miller-Charles benchmark dataset which is commonly used in English word similarity measures. Finally, the MC extended dataset contains 28 Chinese-English pairs are obtained, so we got the standard benchmark dataset for the Chinese-English CLWS research.
Keywords/Search Tags:Cross-lingual word similarity, cross-lingual information access, HowNet, parallel corpus
PDF Full Text Request
Related items