Font Size: a A A

Speaker-Independent Single-Channel Speech Separation Based On Deep Learning

Posted on:2018-08-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N WangFull Text:PDF
GTID:1318330512482674Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Recent years mankind's lifestyle has been changed a lot and communication be-tween human and computer has been increased more and more since the development of computer and Internet.As the most important and convenient method to exchange information,speech has got more and more attention from researchers.With the great success of human-computer interaction techniques in speech synthesis and automatic speech recognition(ASR),more critical problems need to be solved.For instance,back-ground noise and reverberation could bring about great disturbance not only to speech quality and degrade the speech intelligibility but also to the performance of automatic speech recognition system which restricts the application of speech signal processing technologies.It became a important issue to obtain the clean target speech signal from mixed signals.And speech separation is a key aspect of this field which attracts a lot of researchers' sight.Deep learning has made series of great breakthroughs since the 21th century.Firstly,Professor Hinton from University of Toronto adopted deep models in the image codec and text classification fields and achieved great performance improvements.Further-more,Dr Deng li from Microsoft research center applied Deep Neural Network(DNN)in speech recognition task and the recognition accuracy was promoted greatly.After that,deep learning technology was widely used in speech,image and video signal pro-cessing fields.Especially,the industry giants like Google,Microsoft and Baidu started to integrated deep learning method in their products which greatly promoted the stud-ies of deep learning field.Also,the success of deep learning method in industry show the big potential of big data in recognition tasks.However,for the task of speech sep-aration,we focus on the problem of achieve a satisfied separation performance under speaker-independent condition using deep learning method with big data.Among the different approaches to speech separation,single-channel speech sepa-ration method is an important task besides the microphone array technique using spatial information additionally.Among them the speaker-independent approach without prior information of mixed speakers is more difficult.Recently computational auditory scene analysis(CASA)approach to single-channel speech separation has achieved great suc-cess.But it will cause many distortions to target signal.But regression DNN could preserve target signal better.This paper mainly focused on how to conduct speaker-independent single-channel speech separation based on deep learning techniques.Firstly,we proposed a speaker-independent DNN for different gender combina-tions.As we know,the theory basis for multi-speaker speech separation is that the char-acteristic of mixed speakers are discriminative,for instance different formants,spectral distribution and different time period for the same phone.Especially,male speakers are quite different from female speakers for their vocal tracts leading to possibility of separation.In this work we adopted the log-power spectra(LPS)of mixed speech as the input feature of DNN while the LPS feature of clean speech signal as the output target.And then DNN was trained to learn the non-linear mapping relationship between clean speech and mixed speech.Secondly,we constructed a speaker-independent single-channel speech separation system based on speaker combination detection.In the beginning we employed a model-based approach to approximate the distance among different speakers and clustered them into 4 different sub-groups with major difference.After that,we put forward a gender mixture detector.This detector is based on a newly proposed DNN architecture with four outputs,two of them representing the female speaker groups and the other two characterizing the male speaker groups.With it we could make decision about the gender combination of mixed speech and select the corresponding separator to conduct separation.Finally,we adopted the maximum likelihood(ML)estimation criterion to improve the training objective function based on minimum mean squared error(MMSE)crite-rion in the LPS domain used to train regression DNN.With the assumption that the prediction error vector follows a multivariate Gaussian density,a training procedure of ML-trained DNN is designed to update both DNN parameters and the covariance matrix of Gaussian density alternatively.Furthermore,it was demonstrated that MMSE crite-rion is equivalent to the proposed ML criterion under the assumption that the covariance matrix of the prediction error vectors' distribution density is always an identity matrix,namely making a strong assumption that all the LPS components are with equivalent variances.However,the assumption is not always true which limits the generalization capability of MMSE optimization approach.Accordingly,we relaxed the constraints and proposed a new objective training function used in the training of DNN and obtained performance improvements for speech separation.In the end of this paper,we gave the summary of this paper and stated our plans for future work.
Keywords/Search Tags:single-channel speech separation, deep learning, deep neural network, speaker-independent, maximum likelihood estimation
PDF Full Text Request
Related items