| Speech enhancement technology aims to improve the clarity and quality of speech polluted by noise.Speech enhancement models based on generative adversarial networks have the defects of large amounts of parameters and high hardware requirements.It makes the model unable to be implemented in practical application scenarios.In addition,the methods based on generative adversarial networks also have the problem of training measure mismatch and the trade-off between speech distortion and noise suppression.In response to this problem,this paper improves the speech enhancement generative adversarial networks and proposes an improved speech enhancement generative adversarial network.Firstly,this paper analyzes the feature resolution of each layer of the generative model in the speech enhancement generative adversarial networks.The attention of each layer of the model to different areas of input data is analyzed through the method of the thermal curve.Then the feature resolution and model parameters are used as reference indexes to compress the depth of the speech enhancement generative adversarial networks,which significantly reduces the number of network parameters.And to alleviate the decline of feature extraction ability caused by compression of the model depth,a bi-directional long short-term memory structure is added to the model.By fusing the features extracted from the convolution neural network and the bidirectional long short-term memory network,the feature modelling ability of the model is strengthened.Secondly,this paper designs the discrimination model as a regression model which directly regresses the speech quality evaluation indicators and participates in the model optimization process to solve the measurement mismatch problem in model training.The speech quality perception evaluation function is used as the training constraint of the discriminative model to transform the optimization objective of the discriminative model from discrete binary labels to continuous curves.In addition,to preserve more details in the enhanced speech,a multi-scale weighted spectral distance loss function is proposed in this paper.In this method,the multiscale spectrum error of data is obtained by using different characteristic parameters,and the spectrum loss is obtained by weighted fusion,which provides more detailed spectrum constraints for model training.Finally,a new frame reconstruction method is proposed.In this method,the reconstruction value of the sampling points in the overlapping area no longer uses the direct average method,but determines the weighting coefficient according to the distance between the sampling points and the non-overlapping area of the front and rear frames,and uses the weighted sum of the sampling points of overlapping area between the front and rear frame to obtain the reconstruction value.It can alleviate the problem of uneven transition in the voice frame overlapping area and suppresses the signal distortion caused by the inconsistent output information on the front and rear frames.The experimental results show that the proposed model requires less computing ability and greatly improves the scores of four evaluation criteria compared with the baseline model.Compared with other speech enhancement models based on generative adversarial networks,this model has advantages in speed and performance. |