Font Size: a A A

Chinese Bilateral Translation Between Simplified And Traditional-character Texts Based On Conversion Table And Context

Posted on:2016-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z J PangFull Text:PDF
GTID:2308330476953333Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese Characters currently in use include simplified and traditional Chinese characters, mainland China and Singapore use simplified, Taiwan, Hong Kong and Macao and part of overseas Chinese use traditional Chinese characters. With the increasingly exchange of Chinese circle, these different characters brought a lot of obstacles. The existing technology for Simplified-Traditional conversion doesn t do very well. In order to solve this problem, this thesis proposes a method based on the conversion table and the context. The research work of the author of this thesis before achieved 95.6% conversion rate in the accurate evaluation. On the basis of work before, the thesis takes more in-depth study of one simplified Chinese character to many traditional Chinese characters conversion.This problem can be viewed as a classification problem. The thesis proposes to use a combination of statistical models plus rules on this conversion problem. The used statistical models include SVM(Support Vector Machine), the Maximum Entropy Model and Bayes Model. To optimize the classification results, the author first proposed a new text feature selection method called ADMMR, the result of this feature selection method can achieve the same ratio as the expectations of Cross-Entropy and Chi-Square Test, and the experiment shows that they are very good representation for text. With the premise of using the same classification model, ADMMR is better than Information Gain method about 4%; The thesis also proposes the use of Maximum Entropy Models with tf-idf, instead of using the value of 0-1, and the experiment shows that the using tf-idf can be better than the 0-1 method about 2%; the author proposes using ADMMR, expected Cross-Entropy and the Chi-Square Test of the text as a feature selection method, using tf-idf to quantify each feature, and then use SVM and Maximum Entropy Model learning the training data, which will form 6 classification models, and then use a Bayes Learning Model to learn the training data and get the 7th classification model; 6 models before do a voting choices and which class gets the most votes will be the classification categories, if there were two or more classes got the most votes, then using a Bayesian model as an assisted identification, experimental results show that the comparing to individual SVM, the classification results of Maximum Entropy models or Bayesian models combined models is better and more stable.Using the combination of statistical models plus rules to solves simplified-traditional Chinese characters conversion problem: rules are converting simplified characters to the corresponding traditional characters according to the thesaurus; and for the 3% words which can t became a phrase with other word we can use the combined models; experiments show that the method can achieve 98.5% accuracy rate, a better solution to the conversion problems.
Keywords/Search Tags:simplified and traditional characters conversion, simplified and traditional one to many transformations, combination model, maximum entropy, SVM, GIS, ADMMR, feature selection
PDF Full Text Request
Related items