Font Size: a A A

Research On Multichannel Speech Enhancement Algorithm Based On Spatial-Temporal Graph Convolutional Network

Posted on:2023-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:M H HaoFull Text:PDF
GTID:2558306845997829Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
The multi-channel speech enhancement algorithm uses the microphone array to capture multiple features of the signal in the time domain,spatial domain,and frequency domain to separate and estimate the target and non-target signal components,suppressing interference and noise and improving the target speech enhancement effect in unfamiliar and variable scenes.In recent years,data-driven algorithms based on machine learning have been effectively applied in the field of speech enhancement compared with traditional beamforming methods,and have achieved significant performance improvements,especially in dealing with burst noise.However,current multichannel speech enhancement algorithms still do not address the following bottlenecks,which pose a greater challenge:(1)It simply appropriating existing machine learning networks without combining the unique mechanism of acoustic array signals for efficient,controllable,and interpretable modification of the network model,which is not yet applicable to small intelligent human-computer interaction devices.(2)Due to the lack of accurate array distribution a priori information,it is difficult to fully exploited the spatial correlation between multi-channel signals,resulting in poor blind separation of sound sources in practical applications.(3)No effective modeling and decorrelation of the complex spatial-temporal correlations of the acoustic array signals,thus exacerbating the confounding of target and non-target signal components.Therefore,this paper intends to model the graph theory in non-Euclidean space for the complex spatial-temporal correlations implied in the acoustic array signals,and construct a graph aggregation operation and dynamic adjacency matrix based on channel correlation and time-frequency correlation of speech signals.Build a spatial-temporal graph convolutional speech enhancement network to significantly improve the quality of target speech enhancement in the absence of accurate a priori information about arrays and scenes.Three network optimization strategies,namely,the complex spectrum network expansion,the module parameter optimization method and the target reconstruction loss function based on speech intelligibility index,are proposed to further provide feasible solutions and technical support for the implementation of the algorithm in small intelligent humancomputer interaction devices.The details of the study are as follows:(1)In the absence of array and scene a priori information,graph modeling of the multichannel speech enhancement problem is performed based on non-Euclidean space graph theory to parse the spatial correlations implied in the acoustic array signal related to the array topology and source location.Specifically,it includes: graph structure modeling for microphone spatial-temporal data,multi-channel spatial-temporal graph aggregation operation design,dynamic adaptive adjacency matrix construction that can reflect long-time channel association relations and graph neural network construction.(2)The spatial correlation extraction method based on graph convolution operation and the time-domain correlation extraction method based on time-frequency convolution are fused to construct a spatial-temporal graph convolution speech enhancement network to extract the temporal,spatial and frequency correlation features of acoustic array signals for multi-source signal separation,which significantly improves the quality of target speech reconstruction and suppresses noise and interference in the absence of scene and array a priori information.Experiments demonstrate that the spatial-temporal graph convolution speech enhancement network proposed in this paper achieves more than 11%performance improvement in speech quality perception evaluation compared with the current optimal algorithm in a variety of noisy scenarios,and the subjective evaluation metrics also achieve optimal results.(3)In order to further solve the bottleneck of speech enhancement network applied to small intelligent human-computer interaction devices,a complex spatial-temporal graph convolution speech enhancement network extension is carried out to address the phase information loss problem of target speech reconstruction.Optimization of the number of network parameters and real-time system performance.Designed a loss function based on speech intelligibility of human ear auditory perception.Experiments demonstrate that the spatial-temporal graph convolution speech enhancement network based on the complex spectrum improves the upper limit of the target speech enhancement performance,the network parameters number and real-time optimization provide technical support for the engineering implementation of the algorithm,and the loss function based on the speech intelligibility can make the reconstructed target speech output from the network more consistent with the human ear auditory perception.
Keywords/Search Tags:Spatial dependency, Acoustic array signal graph aggregation, Spatial and temporal correlation analysis and fusion, Spatial-temporal graph convolution network, Multichannel speech enhancement
PDF Full Text Request
Related items