Font Size: a A A

The Improvement And Implementation Of Mixed Languagemodel On Japanese Input Method

Posted on:2013-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2268330392469540Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the development of computer proceeds, computer has touched upon everyaspect of our daily life and work. Meanwhile, information input is an important link inman-machine interaction when a computer is being used. Therefore, a well-developedintelligent input method is in demand, which grows stronger as time goes by; while thesignificance of Input Method Editor(IME) gradually stands out.With the technological development of Natural Language Processing, traditionalword model-based IME has evolved into a sort of intelligent input method that based onlanguage model. Using entry granularity as its basis, the language model is able toraise conversion accuracy when a whole sentence is being converted, and thus thepinyin-to-character conversion accuracy is greatly improved. However, completelanguage model is too big to be employed by IME, so it’s necessary to have the modelcompressed to fit the application. The pruning method that retains only core entries isgenerally adopted by common IME to get language model compressed. In this thesis,however, an entry-clustering method is adopted, thanks to the greater strictness ofJapanese language rules. The relationship between entries is replaced by the relationshipbetween word classes, so that corpus can be better utilized and the sparsity withinlanguage model is reduced to a large extent.Meanwhile, in order to cut down the information loss during language modelcompression, the writer improves the clustering method that based on word classes.That is, to cluster according to entry distances and lower down code duplication withinthe same word class: a k-mean clustering algorithm is put forward. Besides, entryfrequency within a word class is taken into account, so that information loss caused bymerging entries with different frequencies can be avoided. In addition, as for theunavoidable loss during language model compression, Bigram model is utilized to makeit up. At the same time, the accuracy of pronunciation model is improved while itscoverage is raised accordingly.Finally, a scalable model based on hybrid language is established, including2-posmodel,2-gram model and pronunciation model. Integrating the features of the abovethree models, the new one is able to improves the conversion accuracy of IME.Comparative testing is made between different models, and the influence of hybridlanguage model on the conversion accuracy of IME is analyzed.
Keywords/Search Tags:Japanese IME, language model, clustering algorithm, pronunciation model
PDF Full Text Request
Related items