Font Size: a A A

The Research Of Key Techniques Of Speech Separation And Speech Recognition

Posted on:2019-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:2428330623450921Subject:Engineering
Abstract/Summary:PDF Full Text Request
As the most important medium for human communication,speech has been widely studied both in academic and in industry.Recently,due to the rapid development of artificial intelligence,speech interaction technology has gained more extensive attention and been widely applied to real-world applications.This paper focuses on speech separation and speech recognition.When it comes to speech separation,there are all kinds of noises and speech of no significance in the mixture speech,in which the speech of interest is hard to be extracted.Speech separation aims to extract the speech of interest from the mixture speech,which helps machine to understand the speech of interest.However,previous speech separation systems are unable to represent the features of speech sufficiently.Moreover,they could not learn the speech features from the mixture speech.As for speech recognition,speech recognition aims to transfer the speech sequences to the correct corresponding text sequences.Nevertheless,in many humanmachine conversation systems,it is not necessary to transfer the speech sequences into the text sequences word by word,because they only pay attention to whether there are keywords of interest in the speech sequences.The detecting of keywords is called the keyword spotting technology.Due to the great progress in deep learning,previous keyword spotting systems are built based on labelling the speech sequences frame by frame,which requires a mature LVCSR system to make correct labelling.This premise is harsh as well as inflexible.In order to solve the problems related to speech separation and keyword spotting mentioned above,this paper mainly focus on the following two innovations:1.In order to solve the representation insufficiency problem of speech signal and the incapability problem of learning from the mixture speech,in this thesis,we propose a deep transductive NMF model(DTNMF)which incorporates a multi-layer structure into NMF and learns a shared dictionary on source signal of each speaker and the mixture signal to be separated.Since the multi-layer structure enables DTNMF to learn more precise presentation of source signal with the non-linear features extracted,DTNMF significantly enhances the performance of speech separation.Experimental results on the popular Libri Speech dataset show that DTNMF outperforms the representative NMF models for separating the mixture of single-channel speech signals.2.In order to spare the need of a LVCSR system for labelling,in this thesis,we proposed a keyword spotting model base on Connectionist Temporal Classification(CTC for short).We develop an end-to-end neural network architecture with a CTC output layer to transfer speech sequences into text sequences.We utilize the Bidirectional Long Short Time Memory to guarantee the bidirectional,long distance context,and avoid the gradient explosion and gradient vanishing problem to some extent.Based on this model,we succeed in cutting the output space of our keyword spotting model in order to improve the convergence speed and decrease the search space,which results in a less complexity of our model and a faster speed of finding the approximate optimal solution.In the meanwhile,the curse of dimensionality is also avoided cutting the output space.Experimental results show that our keyword spotting model has better keyword spotting performance than LVCSR-based keyword spotting model.Moreover,Bi-LSTM end-toend network architecture improves the keyword spotting performance significantly,comparing with both DNN and RNN.
Keywords/Search Tags:Speech Recognition, Speech Separation, Keyword Spotting, Neural Networks, Non-negative Matrix Factorization
PDF Full Text Request
Related items