Font Size: a A A

Dai Language Segmentation Based On Dictionary And Statistics

Posted on:2017-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2348330488965242Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the popularity of computers and Internet applications, people voice synthesis technology made more wide range of applications. Text processing is an important part of the speech synthesis system, and Chinese text similar to the Dai language versions, there are no natural delimiters, therefore, for the purposes of the Dai language speech synthesis system, segmentation is an important part of text processing, segmentation results will directly affect the naturalness of speech synthesis.Word segmentation summed up in three ways:based segmentation method to understand, dictionary-based word segmentation method and a method based on statistics. The method is based on understanding of the use of syntax, semantics and knowledge statements and other information word, more difficult to achieve. Dictionary word segmentation method has a high efficiency, but not able to identify unknown words. The use of statistical segmentation method has good identification of unknown words, but the accuracy of the word is very low. Therefore, Not login words have better recognition of the premise, improve the accuracy of word segmentation in the Dai language, we use a method based on a combination of statistical and dictionary, and conducted in-depth research.The main work includes:1. This paper introduces the segmentation principle of MMSEG, FMM and the conditional random field (CRF).2. Get corpus, corpus downloaded from the network, these corpus finishing. And then build the dictionary, the first is based on the forward maximum matching algorithm (FMM) of the word, he pointed out that it can not disambiguate. To compensate this, we have based MMSEG segmentation, in order to eliminate ambiguity, MMSEG algorithm added four disambiguation rules, however, MMSEG is not able to identify the unknown word, we proposed MMSEG+CRF-based segmentation method, this method of segmentation with some proper nouns, names and place names have a good recognition.3. The experimental results were analyzed, and the three segmentation methods were evaluated, including precision and recall.The results show that:MMSEG+CRF has a high accuracy, correct rate reached 97.7%, recall rate reached 95.6, F1 reached 96.6%, It is required to meet the Thai word, and synthesized speech is a good naturalness.
Keywords/Search Tags:speech synthesis, Tai language word segmentation, Maximum Forward Matching algorithm (FMM), MMSEG, Conditional Random Fields (CRF)
PDF Full Text Request
Related items