| Making computers with emotional voice communication ability has always been a research difficulty and hotspot in the field of human-computer interaction.In human beings communication,the speech signal carries rich text information as well as human emotion characteristics.Currently,conduct emotional transformation on machine-synthesized speech signal is a particularly important and intuitive way to obtain emotional machine voice.Emotional speech conversion is a technology that focuses on the conversion of neutral speech to emotional speech.It is widely used in emotion recognition,medical equipment,communication services and et al.This thesis focuses on the studies of emotion feature and emotion speech conversion model.The main research contents are as follows:Firstly,a speech reconstruction method using L1/2 sparse constraint based Mel frequency cepstral coefficient(MFCC)is proposed.For high quality speech reconstruction,it is usually necessary to consider various types of acoustic feature parameters in the model.For example,the formant model needs formant parameters and fundamental frequency,while the MELP model requires the fundamental frequency,sub-band speech intensity,unvoiced and voiced marks,and residual parameters such as difference peak and frame energy.Theoretically,the more model feature parameters,the better the naturalness and comprehensibility of the speech after reconstruction.However,the computational quantity increases with more features considered,and the reconstructed speech quality is greatly affected by the estimation of different feature parameters.Therefore,the selection of parameters is crucial to the quality of the reconstructed speech.However,estimating the speech amplitude spectrum from the MFCC is an underdetermined problem.To tackle this problem,this thesis proposed a L1/2 constraint algorithm for estimating the speech amplitude spectrum from the Mel cepstral coefficient,and estimates the phase spectrum by using the sparse amplitude spectrum.Finally,the time domain speech signal is reconstructed by using the estimated spectrum.This method not only proves that the L1/2 sparse constraint method has good inverse reconstruction performance in speech conversion,but also shows that the MFCC feature can better simulate the auditory characteristics of the human ear.Secondly,this thesis proposes emotional speech conversion method based on the bidirectional long-term memory network.Traditional speech emotion conversion methods mainly includes neural network(NNs),Gaussian Mixture Model(GMM),Non-negative matrix factorization(NMF)decomposition and their variations.Since the Gaussian mixture model and the non-negative matrix are linear representation functions,they are only suitable for linear representations of relationships between features or simple piecewise linear representations.In addition,due to its transfer function is composed of local regression functions and multiple Gaussian kernels are used in the model,the Gaussian mixture model is easy to data over-fitting.Unlike the Gaussian mixture model,the transformation rules trained by the neural network method are nonlinear,so the conversion effect is usually better than the GMM model.However,the neural network model considers each speech feature of the input as an independent feature,and can't describe the time frame correlation of the speech sequence.Therefore,this thesis proposes the bidirectional long-term memory network for mapping of neutral speech features to emotional speech features,and then through the L1/2 sparse constraint method to reconstruct the sad,angry,happy speech of the converted acoustic features.The experimental results illustrate better naturalness of the converted emotional speech obtained by the proposed method. |