Font Size: a A A

Research And Implementation Of Real-time Voice Activity Detection In Multimedia Live Broadcast Scene

Posted on:2023-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:X HaoFull Text:PDF
GTID:2558306914457084Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The popularity of webcasting has greatly changed the way people work and play,the voice field applies various voice technologies to livestreaming scenarios to improve user experience.Voice Activity Detection(VAD)is a signal processing technology that detects whether there is voice in audio.In the workflow of the entire voice system,VAD,as an important front-end module,directly affects the performance of downstream modules and has a wide range of functions.Application scenarios and values.However,there are still some problems in the application of existing VAD technology in live broadcast scenarios:(1)The robustness of existing algorithms to interference from different devices and complex environmental noise still needs to be improved.(2)Most of the existing algorithms focus on improving offline accuracy and cannot meet the real-time requirements of live broadcast scenarios.(3)In real-world scenarios,there is a domain mismatch between training and test data,which makes the performance of existing algorithms significantly degrade when applied in real-world scenarios.In order to solve the robustness and efficiency of VAD in live broadcast scenarios,this paper proposes a real-time voice activity detection algorithm based on feature enhancement and time-frequency attention.From the perspective of digital signal processing,spectrogram-based mean subtraction is used to reduce device interference in audio signals.By analyzing the commonality between audio classification tasks based on spectrogram and image classification tasks,combined with fast image edge detection algorithm to enhance the sound texture in the spectrogram,help the network to learn the texture information of complex environmental noise,thereby improving the network in complex noise.Robustness in the environment.In addition,based on the difference between the two types of tasks,a simple and effective time-frequency attention module is proposed,which fully considers the different characteristics of the time-frequency representation in the time dimension and frequency dimension,which helps the network to fully learn the time-frequency information and improve the detection accuracy.In order to solve the domain mismatch problem when the algorithm is applied in real-world scenarios,this paper proposes domain adversarial training based on symmetric positive definite representation and transfer weights.The kernel matrix-based representation learning is introduced into domain adversarial training,and the second-order information is used to improve the network’s ability to model complex feature relationships.On this basis,the domain similarity and prediction certainty of the samples are used as the measurement criteria,and the transfer weight is introduced into the loss function to reduce the negative impact of difficult-to-transfer samples on the feature space alignment,thereby achieving more effective domain adaptation.Finally,the algorithm proposed in this paper is encapsulated and connected with the laboratory self-developed live system to realize simple application in the live system.
Keywords/Search Tags:deep learning, voice activity detection, domain adaptation, convolution neural network
PDF Full Text Request
Related items