Font Size: a A A

Research On Automatic Evaluation Methods Of Mandarin Pronunciation Quality

Posted on:2015-03-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1268330422490648Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a core technology of computer assisted language learning and oral proficiency testing, Automatic Pronunciation Quality Evaluation (APQE) has scientific values, academic signification and enormous market in terms of promoting learner flexibility and satisfaction, eliminating subjectivity and instability involved in human scoring, reducing costs and improving timeliness and efficiency. With the fast-growing need for promoting Mandarin (Putonghua in Chinese) in China as well as exporting Chinese to the world, the development of the APQE technology for Mandarin is a highly-anticipated, feature-demanding and competitive field which requires in-depth and systematic researches.Chinese features in monosyllabic tones, with each syllable containing initial, final and tone. Besides, Chinese is rich in syllabic rhymes with distinct syllabic boundaries as well as stresses and retroflex suffixation (Erhua in Chinese). The ternary structure and phonologic characteristics of Chinese syllable are quite different from other language families such as English. So the specific research and creative approach in APQE for Mandarin are necessary in many aspects of characterization, modeling, calculation etc. In addition, the state-of-the-art APQE methods are not ideal enough to evaluate basic pronunciation units, for example initial, final, tone etc, especially in some detailed evaluation task for higher level speaker. Improvement in acoustic modeling and confidence calculation is important for more fine acoustic models and more accurate evaluation models.The dissertation focuses on enhancing the overall performance of Mandarin APQE methods for native-speaker groups. Some improved APQE methods for Mandarin initials and finals are proposed. First, because the traditional goodness of pronunciation (GOP) algorithm has some problems of low precision, possible boundary errors in phoneme segmentation, and poor discrimination between acoustic models, a method based on the phoneme confusion probabilities matrix (PCPM) is proposed. The confusion phoneme set (CPS) of each phoneme is constructed by calculating the PCPM. On the one hand, the use of limited recognition network based on the CPS, improve the accuracy of phoneme segmentation; on the other hand, the use of the phoneme confusion prior probabilities and the posteriori probability (PP) calculation based on the CPS, improve the accuracy and discrimination between models. Secondly, in order to extending the range of the pronunciation to be evaluated and improving the coverage scope of acoustic models, a method based on the extended pronunciation space (EPS) is proposed. Using mispronunciation samples to extend the standard pronunciation space (SPS), typical mispronunciations of each phoneme in the SPS will be finely modeled, and then posterior probability calculated within the EPS is more accurate and effective. Samples containing the mispronunciations can be easily obtained, but it is difficult to annotate them in detail, and the workload is huge. An unsupervised learning method is adopted to cluster the mispronunciations, and an automatic update strategy of models is designed in order to improve continuously the accuracy of evaluation models. Finally, because the above two methods that are calculated with single dimensional confidence and threshold judgment are not robust enough, an integral method based on multi-dimensional confidence vector is proposed. Given speech segment to be evaluated and the corresponding phoneme, the PP of all phoneme within the CPS and the EPS of the given phoneme are calculated respectively, and then put them together in order, forming a multidimensional confidence vector, which is regarded as new evaluation features. When classifiers for different pronunciation quality of each phoneme are constructed respectively, pronunciation quality of initials and finals can be reevaluated and the man-machine correlation coefficient is0.893beyond the average performance of human evaluation.In the APQE for Mandarin tones, in order to effective acquisition and multilevel utilization of fundamental frequency (FF) feature, an integral method based on multi-dimensional confidence vector is proposed. At the speech frame level, the FF and its one order and two order differences are added to the39-dimension Mel Frequency Cepstrum Coefficient (MFCC), a total of42dimensions, and embedded tone models of initials and finals are constructed. At the syllable level, a number of statistical features about FF of the current syllable and its neighbors, a total of12dimensions, and explicit tone models are constructed employing Gaussian Mixture Model (GMM) as the classifier. Finally, the PP of5tones calculated using the embedded tone models and the PP of5tones calculated using the explicit tone models are combined to form a10-dimensional confidence vector, as new evaluation features. When classifiers for different pronunciation quality of each tone are constructed respectively, tone pronunciation quality can be reevaluated. The above method is adaptive and robust, that can fuse two kinds of complementary modeling methods, combine the FF features of long segment and short segment, and not consider the threshold selection. The results show that the integral method can effectively improve the overall performance of tone evaluation and the man-machine correlation coefficient is0.899beyond the average performance of human evaluation.In the APQE for Mandarin Erhua, according to the requirement of the Erhua evaluation in the National Mandarin Proficiency Test (PSC in China), a method based on the classification idea is proposed. According to phonological rules and acoustic characteristics of Erhua, several representative features, including duration, pronunciation confidence, formant etc are extracted and an improved AdaBoost based integral classifier is employed. The base classifier in each iteration will update respectively weight values according to different category and a positive item related to the category prior probability and the number of category is added when calculating the weight value. These improvements can greatly reduce the precision requirement of the base classifier compared with the traditional AdaBoost algorithm, and are especially suitable for the multi-class classification problem on the imbalanced data set. The results show that the improved AdaBoost based classifier can achieve the best performance beyond the traditional AdaBoost based classifier and other classic single classifier and the problem of APQE for Erhua can solved as a typical classification problem. In addition, the above method can be easily generalized to the evaluation of other sound variations, such as neutral (or light) tone, tone sandhi and so on.In the test of recording speech corpus of PSC, the overall score difference of the experimental system based on the above research works and human evaluation is very close, from4.26to3.71, which can establish a good foundation for further preparation of the practical APQE system for Mandarin.
Keywords/Search Tags:Automatic pronunciation quality evaluation, Mandarinevaluation, Phoneme evaluation, Initials and finals evaluation, Tones evaluation, Erhua evaluation
PDF Full Text Request
Related items