Font Size: a A A

End-to-end Target Speech Extraction Algorithm Research

Posted on:2022-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HanFull Text:PDF
GTID:2518306749983269Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In acoustic complex scenarios,humans can easily focus on the speech they are interested in,but it is particularly difficult for machines.This phenomenon is known as cocktail party problem.Target Speech Extraction(TSE)is one of the key technologies to solve the cocktail party problem.It has become an indispensable technology in the fields of webcasting,teleconferencing,smart home and has attracted extensive attention from academia and industry.With current development of deep learning,how to design a target speech extraction system with excellent performance and robustness is very important.In recent years,the research on end-to-end target speech extraction has made significant improvements,however,there are still many problems to be solved when they are applied to real scenarios.Three main challenges are as follows: 1)how to effectively model the acoustic information of the target speaker to extract more accurate target speech;2)when the multi-channel microphone array is available,how to effectively extract the multi-channel spatial information to improve the end-to-end target speech extraction performance;3)when there is acoustic mismatch between training and testing utterances,how to design a robust target speech extraction algorithm to extract the target speech for this complicated condition.The research works of this thesis mainly focus on the above three main issues.Firstly,a new speaker adaptation algorithm that based on attention mechanism is proposed,it improves the target speech extraction performance by dynamically adjust the acoustic bias information according to the different acoustic characteristics between the target speaker and the mixed speech Secondly,in order to improve the multi-channel end-to-end target speech extraction,a channel decorrelation algorithm is proposed to extract the spatial difference information for enhancing the acoustic information of the target speaker;Finally,to improve the robustness of target speech extraction in complex acoustic conditions,a novel time-frequency domain speech separation and extraction architecture,termed DPCCN,is proposed.Based on the DPCCN,a Mixture-Remix mechanism is proposed to exploit the mixed speech data that collected from real conditions to fine tune the acoustic model for improving the target speech extraction system performance.In order to verify the effectiveness of the proposed methods,all experiments in this thesis are performed on the international public speech separation dataset.Specifically,the multi-channel reverberant WSJ0-2 Mix dataset is used to evaluate the performance of the proposed attention and channel decorrelation algorithms;The Libri2 Mix and Aishell2 Mix are used to evaluate the effectiveness of DPCCN systems.The experimental results show that,the three algorithms proposed in this thesis can significantly improve the performance of target speech extraction over state-of-the-art baseline systems,which provides an important reference for the application of target speech extraction technologies in industry.
Keywords/Search Tags:Target speech extraction, attention, channel decorrelation, multichannel differential information, DPCCN
PDF Full Text Request
Related items