Research On Automatic Construction Of Speech Corpus And Speech Minimized Labeling

Posted on:2014-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:Z N Zhang

Full Text:PDF

GTID:2268330401984133

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Although Current State-of-the-art Text-to-Speech System can produce intelligentvoices, the synthesized utterances are lack of rich prosodic features encapsulated inthe original speech. This is due to the fact that the models of prosody built from singlesentence recording have a poor quality in regarding to the synthesized comprehensionand naturalness, such as CMU ARCTIC. The construction of the speech corpus whichencapsulate rich prosody and context information is the prerequisite of synthesizinghighly comprehensive and natural utterances. However, development of such richspeech databases requires a large amount of effort and time, besides, the duration ofthe construction will be very long. An alternative is to exploit the multi-paragraphmonologues in audio books and other speech resources, such as broadcasts, to buildthe large speech databases automatically. These monologues already capture richprosody including varied intonation contours, pitch accents and phrasing patterns.Nonetheless, the processing of such audio books poses several challenges includingsegmentation of long speech files and the automatic extraction of the multi-paragraphspeech files according to the corresponding transcriptions.Hence, an approach which can extract trainable parts from original speech filesin the light of the corresponding texts arranged in paragraphs will curtail the durationof the building of the corpus to a large part. In addition, a technique that can segmentlong speech files obtained using above method into isolated sentences according tocorresponding transcripts should be proposed at the same time. In view of this point, not only can it reduce the cost spending on the construction of large speech corpus bylabeling manually, it can also increase the comprehension and naturalness of thesynthesized voices ultimately.In view of this, we address the issues of automatic speech-text alignmenttechnique and automatic segmentation of long speech files with high accuracy. Thispaper is organized as follows:First of all, an approach was proposed which can break long speech files intosingle utterances, retaining original prosodic features and context information. Thismethod integrates the zero-labeling sentence segmentation method based on force-alignment technique proposed earlier and the minimum labeling classification methodbased on Co-Training technique. The former method will generate an initial preciselabeling set which was referred to as the input of the classification system to identifythe boundaries of the sentences by using Semi-supervised learning method further.The whole procedure was based on a iterative mechanism built on the same time axis.It has been demonstrated that the precision of the segmentation of the long speechfiles can be increased to96.2%. Eventually, the original speech files will besegmented into relatively smaller chunks according to the breaking points.Secondly, we proposed an approach that can align the speech files withcorresponding transcriptions automatically. This approach was based on GoogleVoiceTechnology. By using force-alignment technique and minimum-word of criteria, thetrainable speech and corresponding text sequences will be extracted from originalfiles. At the same time, an iteration mechanism was proposed to maximize theextraction rate.At last, an isolated word recognition system based on tri-phones was appended toevaluate the performance of the speech-text alignment technique and automatic segmentation of long speech files method proposed above.It has been demonstrated by our experiments that these method can be utilized toautomatic construction of speech corpus in near future.

Keywords/Search Tags:

Speech-Text System, Sentence Segmentation, Voices Synthesis, Force-Alignment

PDF Full Text Request

Related items

1	Research On Automatic Speech-Text Alignment For Mongolian Long Audio
2	Technology Research On Chinese English Text Level Sentence Alignment
3	Research On Unannotated Long Chinese Speech Text-speech Alignment
4	Study On Automatic Construction Of Speech Database~2
5	Research Of Embeded Speech Synthesis Technology
6	Research And Realization Of Embedded Speech Synthesis System
7	Text-Speech Alignment Based On General Speech Recognition
8	Research Of Long Speech And Text Alignment
9	Research Of Mandarin Text-Speech Alignment Based On SailAlign
10	Text Analysis Of Burmese Language For Speech Synthesis