Dai Language Segmentation Based On Dictionary And Statistics

Posted on:2017-03-24

Degree:Master

Type:Thesis

Country:China

Candidate:H Li

Full Text:PDF

GTID:2348330488965242

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the popularity of computers and Internet applications, people voice synthesis technology made more wide range of applications. Text processing is an important part of the speech synthesis system, and Chinese text similar to the Dai language versions, there are no natural delimiters, therefore, for the purposes of the Dai language speech synthesis system, segmentation is an important part of text processing, segmentation results will directly affect the naturalness of speech synthesis.Word segmentation summed up in three ways:based segmentation method to understand, dictionary-based word segmentation method and a method based on statistics. The method is based on understanding of the use of syntax, semantics and knowledge statements and other information word, more difficult to achieve. Dictionary word segmentation method has a high efficiency, but not able to identify unknown words. The use of statistical segmentation method has good identification of unknown words, but the accuracy of the word is very low. Therefore, Not login words have better recognition of the premise, improve the accuracy of word segmentation in the Dai language, we use a method based on a combination of statistical and dictionary, and conducted in-depth research.The main work includes:1. This paper introduces the segmentation principle of MMSEG, FMM and the conditional random field (CRF).2. Get corpus, corpus downloaded from the network, these corpus finishing. And then build the dictionary, the first is based on the forward maximum matching algorithm (FMM) of the word, he pointed out that it can not disambiguate. To compensate this, we have based MMSEG segmentation, in order to eliminate ambiguity, MMSEG algorithm added four disambiguation rules, however, MMSEG is not able to identify the unknown word, we proposed MMSEG+CRF-based segmentation method, this method of segmentation with some proper nouns, names and place names have a good recognition.3. The experimental results were analyzed, and the three segmentation methods were evaluated, including precision and recall.The results show that:MMSEG+CRF has a high accuracy, correct rate reached 97.7%, recall rate reached 95.6, F1 reached 96.6%, It is required to meet the Thai word, and synthesized speech is a good naturalness.

Keywords/Search Tags:

speech synthesis, Tai language word segmentation, Maximum Forward Matching algorithm (FMM), MMSEG, Conditional Random Fields (CRF)

PDF Full Text Request

Related items

1	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
2	Research Of Chinese Word Segmentation With Conditional Random Fields
3	Research Of Named Entity Recognition Based On Conditional Random Fields
4	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
5	Study On The Tibetan Word Segmentation And Named Entity Recognition With Conditional Random Fields
6	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
7	Study Of Automatic Segmentation Technique Based On Conditional Random Fields
8	Application Of Conditional Random Fields In Mongolian Word Segmentation
9	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
10	The Effect Of Part Of Speech On Chinese Word Segmentation