| Reducing background noise to improve speech quality and realize automatic speech recognition is a long-term hot topic in speech acoustic processing and other related fields.Among them,speech enhancement refers to the noise removal of noisy speech,which is often used as a front-end preprocessor for other acoustic tasks,and speech recognition refers to the automatic transcription of speech signals into text content by computers.Although deep learning-based speech enhancement and recognition have been studied for nearly a decade,how to maintain the performance of speech algorithms in complex scenarios is still one of the hot issues today.In general,there are five main challenges for complex scenes:(1)In scenes with a low signal-to-noise ratio,speech structural information is submerged in noise,resulting in failure of speech feature extraction (2)The noise intensity in complex scenes is dynamic and changeable,which increases the application scope of the speech enhancement system;(3)Under the unseen situation,the discriminative speech enhancement algorithm has poor generalization and easy to fail;(4)The limited personal resources cannot meet the resource(data,computing power)requirements of speech-related tasks;(5)The distortion issue from speech enhancement seriously affects the accuracy of speech recognition.In order to address challenges(1-2),this thesis proposes three speech enhancement algorithms for complex scenarios,namely: information distillation-based IDANet,collaborative learning-based SECL,and iterative learning-based SEIL.Experimental results show that,compared with the latest speech enhancement algorithms,the proposed three algorithms can significantly improve speech quality.To address the challenge(3),this thesis explores the potential value of generative algorithms in speech enhancement tasks and proposes a noise-aware conditional diffusion model dubbed NA-CDiffu SE.Compared with discriminative models,the proposed algorithm is less susceptible to the overfitting problem and exhibits stronger generalization.Besides,compared with the existing diffusion model-based speech enhancement methods,NA-CDiffu SE improves the voice quality,showing significant advantages.To address the challenge(5),this thesis designs a multi-task-based cascaded model,using the additional guidance from downstream tasks to constrain the training process of speech enhancement.At the same time,this thesis introduces transfer learning,contrastive learning,and knowledge distillation together to address the challenge(4).By cascading the speech enhancement model and lightweight speech recognition model,the designed algorithm can effectively improve the accuracy of speech recognition in noisy scenarios while maintaining efficiency.Based on the above research,this thesis designs and implements a speech system for urban traffic scenarios.The system allows recording/uploading speech and integrates a variety of enhancement and recognition algorithms.On the basis of displaying algorithms,the system can also compare different algorithms.As a result,this system can meet the basic needs of speech enhancement and recognition. |