Font Size: a A A

Research On Automatic Labeling Of Speech Synthesis Corpora

Posted on:2015-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y YangFull Text:PDF
GTID:1268330428499926Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In recent years, the speech synthesis technology has been well developed in the aspects of R&D and practical applications. The naturalness and quality of the synthe-sized speech are improved significantly. Nowadays, the mainstream speech synthesis methods consist of Hidden Markov Model (HMM) based statistical parametric speech synthesis and large-corpus-based unit selection and waveform concatenations synthesis. Before constructing the speech synthesis systems by these speech synthesis methods, we need to construct the speech synthesis corpus first. These are several ways to obtain the speech data for this purpose. We can design the text material for speech synthesis first, then record the speech data. Alternatively, we can also utilize the existent speech data (such as the video, audiobook database etc). No matter how we obtain the speech data, the labeling is always necessary for constructing the corpus.Speech database annotation commonly consists of phonetic labeling and prosodic labeling. The phonetic labeling consists of obtaining the phoneme sequences and pho-netic segmentation. The phonetic segmentation stands for labeling the start and end time of each phoneme. Generally speaking, the phonetic segmentation is only used in the ini-tialization step of model training. The performance of the automatic phonetic labeling method is good enough for constructing the speech synthesis system. The prosodic labeling is to label the prosodic information for the speech data. The prosodic cate-gories to be labeled varies with languages. For the Mandarin speech synthesis corpora, the prosodic categories to be labeled mainly stand for the prosodic boundaries. Be-cause the prosodic labels are used as the context information for the model training, the accuracy of prosodic labeling will affect the naturalness and quality of the synthetic speech. The manual prosodic labels are commonly used in the speech synthesis cor-pora. However, the amount of manual labeling work increases significantly with the size of corpora. Thus, several human annotators are necessary for the prosodic labeling work, leading to high labor cost of this approach. In addition, the prosodic labels are judged subjectively, it is not easy to keep the consistency among different human anno-tators. Therefore, how to labeling the corpora preciously and automatically has become an important research direction.This dissertation focuses on the labeling of the speech synthesis corpora. Several different methods of automatic prosodic labeling are proposed according to the specific application and corpus style. The main work in the dissertation is listed as follow:The prosodic labeling method of HMM-based acoustic modeling and state decod-ing for the speech synthesis corpora is proposed. The advantages of this method are as follows:When using the acoustic feature distributions for the prosodic labeling, it can make full use of the known context information. The prosodic labeling results are obtained by decoding the whole sentence, the relation among the prosodic labels of different positions are considered. The framework which is similar to speech recogni-tion is adopted, with the advantage that it can be convenient to use the model training and deocding algorithm in the field of speech recognition. In the implementation, we proposed the exhaustive-search-based method first in order to analyze the influence of different features and context features on the labeling results and verify the feasibility of the proposed method. Then, the Viterbi-based method is proposed to speed up the labeling procedure, while maintaining the performance of labeling.The Deep Neural Network-HMM (DNN-HMM)-based acoustic modeling for the automatic prosodic labeling is designed and implemented, This method utilizes the DNN’s strong ability for acoustic modeling to achieve a better performance of the au-tomatic prosodic labeling.The unsupervised prosodic labeling method of combining the feature-clustering initialization and HMM-based acoustic modeling is proposed. This method can be used to obtain the prosodic labels of the speech data without using the manual prosodic la-bels. So the personalized speech synthesis systems with multi-speakers and multi-styles can be constructed automatically by this method. We validated the effectiveness of the method by the experiments of prosodic phrase boundaries labeling of the reading style corpus and the emphasis expression labeling of the audiobook database.The hidden-emphasis-state-based unsupervised emphasis expression labeling and synthesizing method is proposed. This is necessary because in our previous work, the emphasis expression label is treated as one kind of common context information. When the amount of emphatic units is small, the emphasis expression labels have little influ-ence on the decision tree. It is difficult to train the precise emphatic and neutral models in this condition. This can then affect the labeling performance of the emphasis expres-sion and the exhibition of emphasis expression in the synthetic speech. In order to avoid this problem, we consider to separate the emphasis expression labels from other context information and represent them by the linear transformations. In addition, this method describes the labels of emphasis expression in a probabilistic way, which can describe the emphasis expression in the speech better than the binary way used in our previous work.
Keywords/Search Tags:speech synthesis, hidden Markov model, deep neural network, Viterbialgorithm, prosodic boundaries, emphasis expression
PDF Full Text Request
Related items