Deciphering Unknown Chinese Characters Based On Corpus Feature

Posted on:2016-05-31

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhao

Full Text:PDF

GTID:2348330503494263

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Substitution decipherment has long been widely used in Natural Language Processing. In particular, it has achieved excellent performance in machine translation and decipherment of unknown languages. However, the effective work is mainly based on corpus of alphabetic languages, rather than non-alphabetic ones. In this study, we aimed to explore the possibility of applying substitution decipherment to corpus of non-alphabetic ones, especially logographic language. Considering the fact that bilingual corpus is scarce in most unknown languages, we tried to process the decipherment on non-bilingual corpus.In this work, the premise of our experiment is a set of known text as training data and a set of text with part of unknown characters as test data. Both the training data and test data belong to the same writing system. In our study, we used text of over two hundred thousands Chinese characters. In our experiment, we first introduced decipherment based on language model. In order to solve the consequent issue of possible combinatorial explosion, we added beam search to increase the computability. And we adopted word representation to enhance the accuracy of our results. In addition, stroke feature was added as a hyper-parameter to improve the efficiency of our decipherment.The approaches in this study are suitable to decipher the unknown characters in ancient Chinese text, such as the Oracle. It is a tremendously complicated and burdensome job to manually identify the unknown characters in some texts, because it demands expertise of massive experiences and specializedknowledge. Decades of hard work may consumed if there is no related bilingual corpus. With the help of our work, the size of the candidates of unknown characters could be minimized to dozens from the previous universal set of unknown characters, which would greatly reduce the manual work. Furthermore, the possibility of figuring out the synonyms the unknown characters could be enhanced remarkably due to our exploration of the context, in which case, the possibility of understanding the entire text would be improved even if the exact match is eluded. Nevertheless, the sparse of data still presents difficulties for our study. To address the problem, we adopted beam search. We hope to work out a better solutions for it.

Keywords/Search Tags:

beam search, stroke feature, word embedding representation

PDF Full Text Request

Related items

1	Research On The Representation Of Word Embedding Based On Knowledge Fusion
2	A Study On Improving Multi-prototype Word Embedding
3	Research On Construction Method Of Entity Semantic Vector In Science And Technology Field
4	Research And Application On Word Embedding Of Low Frequency Words
5	Sentence Embedding Representation With Syntactic Information Learning Method And Application Research
6	Dynamic Weighting Of Word Embedding And Distributed Learning Strategies
7	Research On Vector Representation Of Text Sentiment Analysis Based On Word Vectors
8	Feature Representation And Neighbor Embedding Based Image Super Resolution
9	User Feature Recognition Based On Spatio-temporal Word Embedding Of Trajectory
10	Research And Improvement On Text Classfication Based On Word Embedding