Font Size: a A A

Research On Speech Preprocessing Of Speech Recognition For Multi-talker Conversations In Complex Acoustic Environments

Posted on:2021-03-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:L SunFull Text:PDF
GTID:1368330602994250Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,artificial intelligence industry is in a stage of rapid development,all kinds of intelligent algorithm,intelligent hardware emerge in endlessly,they will pro-foundly change people's life in the near future.In order to facilitate natural,barrier-free communication between people and intelligent devices,the first task is to promote the study of speech recognition,which let the machine understand human language,and transform it into accurate textual information.It can be imagined that through speech recognition,various instructions and demands of human beings can be conveniently sent to machines,which can respond to each other and interact with humans in turn.This is what people imagine "intelligent robot" can achieve.Thanks to the efforts of generations of researchers,the development of speech technology has made great achievements.At present,the most advanced speech recognition system can achieve quite high recogni-tion performance under relatively quiet and no interference scenes.In some cases,there are even studies that claim to be beyond human stenographers.However,when intelli-gent speech recognition technology is implemented,the ideal level of speech recogni-tion is not so easy to achieve due to the complexity of real scenes.On the one hand,a speech signal is transmitted from the source to the receiver in a medium such as air,as the form of sound waves.In the process of transmission,a variety of interference factors will be encountered,such as environmental background noise and reverberation,which will greatly reduce the quality and intelligibility of speech.At the same time,so it's a huge challenge to get a machine to accurately identify the signals that are being interfered with by so many complex factors.On the other hand,most of the current studies focus on the recognition of a single target,that is,only one speaker speaks at the moment.When the number of speakers increases,the voices of different people are mixed together,resulting in a sharp decline in recognition accuracy.Based on the two aspects,it is still a difficult problem that how to solve the speech recognition problem in complex acoustic scenes.Generally speaking,the whole speech recognition process can be divided into two main parts:front-end part and back-end part.Back-end algorithm refers to the part involved in recognition,generally including acoustic model modeling,language model modeling and decoding algorithm,that is,the part directly from the signal to the recognition of the text.The front-end part is to solve the interference problem in the original received signal and provide the speech signal as pure as possible for the back-end algorithm,which can also be called the preprocessing algorithm.For different types of interference,there are different preprocessing algorithms,such as noise reduction preprocessing in high-noise environments,de-reverberation preprocessing in high-reverberation scenes,speaker segmentation clustering preprocessing or speech separation preprocessing in multi-speaker scenes,and so on.This dissertation focuses on the problem of speech recognition in multi-talker conversations,and studies various front-end preprocessing algorithms in order to provide a complete algorithm processing framework.Firstly,to solve the problem of the environment noise,there are two kinds of speech enhancement algorithms:traditional speech enhancement and deep-learning based speech enhancement.Traditional unsupervised speech enhancement algorithm can deal with stationary noise well,but it is difficult to deal with non-stationary noise.Over the years,many studies have proved that the supervised speech enhancement based on DNN can achieve great performance improvement compared with the traditional speech enhancement,especially in the case of dealing with non-stationary noise.How-ever,it has some problems in generalization in the face of complex scenes,such as speech distortion and low intelligibility.In this dissertation,LSTM was introduced to capture the long time feature of speech sequences through its powerful temporal model-ing capability.In addition,the advantages and disadvantages of different methods were explored in the design of speech enhancement objective function.At the same time,the multi-target learning method was introduced to capture the complementarity between different methods,which eventually improved the algorithm capability and generaliza-tion.It was verified on the NSF Hearable Challenge data,and got a good improvement in listening perception.Secondly,to solve the problem of speech separation for multi-talker speech,we proposed a speaker-dependent speech separation system that combined deep-learning based methods with traditional array signal processing methods.Specifically,we de-signed a two-stage single-channel speech separation framework,which was based on the given speaker annotation information,it could extract the speech of the target speaker with limited training data.Furthermore,we combined it with the array algorithm to es-timate the target speaker more accurately by using the spatial information.At the same time,it avoided the permutation problem.We verified the performance of the algorithm on the CHiME-5 challenge data in real far-field multi-talker conversations.Thirdly,in the absence of the prior information of the speaker,the speaker di-arization is needed to preprocess the multi-talker conversational data.The traditional speaker diarization scenario is relatively simple,mainly focusing on broadcast data and telephone data,but it performs poorly in more complex environments.We proposed a speech enhancement model based on progressive multiple targets and a speech enhance-ment pre-selection algorithm based on SNR estimation,which could select appropriate enhancement targets in different scenarios.Finally,the effectiveness of the overall de-sign was verified in the DIHARD challenge.Finally,for multi-talker speech recognition task in complex acoustic environments without any prior knowledge,we proposed a multi-array speech separation algorithm which simultaneously estimated multiple speakers from far-field data.The separated speech data reduced the confusion of the speaker diarization system,and ultimately improved the performance of multi-talker speech recognition.At the end of this dissertation,we summarize all the research work and look for-ward to the future work.
Keywords/Search Tags:speech signal preprocessing, speech recognition of multi-talker conversations, speech enhancement, speech separation, speaker diarization, deep-learning, CHiME challenge
PDF Full Text Request
Related items