| With the advancement of technology and the progress of society,the demand for highquality communication has become increasingly urgent.However,due to the limitations of communication network bandwidth and hardware acquisition equipment,public switched telephone networks still employ narrowband voice signal communication.The narrowband voice signal spectrum is confined to the low-frequency range(300-3400Hz),lacking highfrequency components,leading to degraded voice quality susceptible to interference noise.Speech Bandwidth Extension(SBE)technology aims to restore the high-frequency components of narrowband voice signals,improve overall quality,and achieve high-quality communication at low bit rates.This technology can also contribute to the development of related fields.Traditional SBE techniques rely on signal processing and statistical modeling,constrained by acoustic properties,resulting in suboptimal reconstructed wideband voice quality.In recent years,with the rise of deep learning,the SBE field has gradually evolved,and driven by big data,the reconstructed voice quality has improved compared to conventional methods.However,many researchers have overlooked the importance of processing input speech with audio codecs.In fact,processing input speech to simulate real communication quality can significantly enhance model performance and practicality.This study will continue to explore the SBE field based on deep learning in conjunction with the EVS codec,which meets 5G communication standards,as detailed below:In Chapter 3,a Speech Bandwidth Extension method based on complex-valued Convolutional Neural Networks(CNNs)is proposed.Unlike previous neural network models that relied on single-domain modeling strategies,this chapter innovatively designs a model with complex-valued CNNs as the main body,dividing the input narrowband speech into real and imaginary parts and then building separate models for each,constructing a two-path mapping.The model input is the real and imaginary part data obtained after Short-Time Fourier Transform(STFT)without any feature extraction,avoiding the problem of insufficient feature representation caused by a single feature extraction method.This method uses the powerful nonlinear modeling capabilities of neural networks to learn the mapping relationship between low-frequency real and imaginary components and high-frequency real and imaginary components,thus achieving speech bandwidth extension.In Chapter 4,an end-to-end neural network model based on temporal feature weighting enhancement is proposed.CNNs have excellent feature extraction capabilities,but they are slightly lacking in processing continuous temporal signal context dependencies.To address this,a neural network with dilated causal convolution is proposed to compensate for this deficiency.Additionally,a multi-head self-attention mechanism combined with a Gated Recurrent Unit(GRU)is designed in the bottleneck layer.The GRU leverages its sensitivity to context relationships to compensate for the relative position confusion in the self-attention mechanism.Compared to Long Short-Term Memory(LSTM)networks,GRUs can greatly increase the model’s information processing speed without sacrificing too much performance,ensuring the algorithm meets real-time requirements.In Chapter 5,we discuss challenges faced by deep learning models in speech signal processing tasks,including inadequate and imbalanced data feature exploitation,lengthy learning processes,and the need for improved generated speech quality.To address these issues,we propose a novel neural network model for SBE that effectively integrates diverse data characteristics,capturing more low-high frequency correlations within limited data features and reducing learning duration.We incorporate a multi-head self-attention mechanism with residual structures to emphasize key feature weights and account for speech signal causality,optimizing feature utilization.Moreover,we employ a hybrid loss function combining time-frequency attributes and Mel spectrum properties,fostering model learning optimization from various perspectives and ultimately enhancing generated speech quality.Empirical results validate the outstanding performance of our proposed model in the SBE domain. |