Font Size: a A A

Deciphering Unknown Chinese Characters Based On Corpus Feature

Posted on:2016-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhaoFull Text:PDF
GTID:2348330503494263Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Substitution decipherment has long been widely used in Natural Language Processing. In particular, it has achieved excellent performance in machine translation and decipherment of unknown languages. However, the effective work is mainly based on corpus of alphabetic languages, rather than non-alphabetic ones. In this study, we aimed to explore the possibility of applying substitution decipherment to corpus of non-alphabetic ones, especially logographic language. Considering the fact that bilingual corpus is scarce in most unknown languages, we tried to process the decipherment on non-bilingual corpus.In this work, the premise of our experiment is a set of known text as training data and a set of text with part of unknown characters as test data. Both the training data and test data belong to the same writing system. In our study, we used text of over two hundred thousands Chinese characters. In our experiment, we first introduced decipherment based on language model. In order to solve the consequent issue of possible combinatorial explosion, we added beam search to increase the computability. And we adopted word representation to enhance the accuracy of our results. In addition, stroke feature was added as a hyper-parameter to improve the efficiency of our decipherment.The approaches in this study are suitable to decipher the unknown characters in ancient Chinese text, such as the Oracle. It is a tremendously complicated and burdensome job to manually identify the unknown characters in some texts, because it demands expertise of massive experiences and specializedknowledge. Decades of hard work may consumed if there is no related bilingual corpus. With the help of our work, the size of the candidates of unknown characters could be minimized to dozens from the previous universal set of unknown characters, which would greatly reduce the manual work. Furthermore, the possibility of figuring out the synonyms the unknown characters could be enhanced remarkably due to our exploration of the context, in which case, the possibility of understanding the entire text would be improved even if the exact match is eluded. Nevertheless, the sparse of data still presents difficulties for our study. To address the problem, we adopted beam search. We hope to work out a better solutions for it.
Keywords/Search Tags:beam search, stroke feature, word embedding representation
PDF Full Text Request
Related items