Font Size: a A A

Research On Speech Conversion Algorithm Based On Generative Countermeasure Network

Posted on:2023-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y J TangFull Text:PDF
GTID:2568306914460664Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The specific process of speech conversion refers to converting the speaker’s identity in speech without changing the text content of speech.According to whether the training corpus is parallel or not,it can be divided into speech conversion under parallel and non parallel conditions.Parallel and non parallel refer to whether the speech contents of the source and target speakers correspond one by one.However,in fact,the acquisition of parallel corpus needs to cost a lot of money,so the research of speech conversion under non parallel corpus has become one of the current research hotspots.At present,the general quantifiable index of speech conversion is the measurement of its sound quality and similarity.In this paper,spectral distortion,average subjective opinion score and blind listening test are used for quantitative evaluation.In addition,the waveform and spectrum of generated speech can also be observed subjectively.In this paper,the speech conversion model based on cyclegan is used as the benchmark system to compare with the subsequent improvement experiments.The main research content is to improve the benchmark system by using spade-in regularization layer and cam attention mechanism,so as to achieve better results in sound quality and similarity.Firstly,aiming at the problem that part of the structure of Mel spectrum will be damaged in the Mel spectrum dimension after the direct application of cyclegan for speech conversion,this paper attempts to combine the spatial adaptive regularization proposed by NVIDIA in 2019,which performs well in image semantic synthesis,in order to preserve all the spectrum details.Spade works well in image semantic synthesis,which shows that it can well retain the semantic information of the image.Therefore,for the possible destruction of time-frequency structure,the original in is improved.After experimental verification,the original timefrequency structure of the speech is damaged,and the Mel spectrum is observed after improvement,and the time-frequency structure is retained.Compared with the benchmark system,the improved model using spadein regularization layer makes the Mel spectrum of the converted speech clearer and the spectrum details more complete.The average MCD is increased by 2.19%,the average MOS is increased by 3.85%and the average ABX is increased by 5.56%.The experimental results show that this method has a certain effect on the sound quality and similarity of the converted speech.Secondly,in order to improve the sound quality and similarity of model converted speech,similar cam based attention mechanisms are added to the discriminator and generator.After adding the cam attention mechanism,the generator can better learn where the coding feature map is an important area worthy of attention after obtaining the label information of the auxiliary classifier,and strengthen the ability of the model to convert to the target style.In addition,the attention map improves the discriminator’s resolution by making the discriminator focus on the difference between real target samples and pseudo target samples.Therefore,the model can learn the characteristics of the target more realistically.Furthermore,combining the two innovations of this paper,the spade-in structure and cam attention mechanism training strategy are applied to cyclegan at the same time,and the performance of the model is further improved.The performance of spectrum diagram and time domain waveform diagram is also better than that of spade-in cyclegan and cyclegan model based on cam.In terms of spectral distortion and average subjective opinion score,Blind listening tests have achieved the best results,which has been significantly improved compared with the benchmark system.
Keywords/Search Tags:speech conversion, generative countermeasure network, spade-in, attention mechanism, non parallel text
PDF Full Text Request
Related items