A Study Of Deep Learning Based Child Speech Extraction In Realistic Conditions

Posted on:2021-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:X Wang

Full Text:PDF

GTID:2428330602994316

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Speech is one of the most commonly used ways of information transmission.Re-cent years have seen a literal explosion in the use of child-centered audio-recordings,gathered as infants and young children go about their day.The resulting data are of interest to both a wide range of theories(e.g.,developmental psychology,cognitive sci-ence)and numerous applications(e.g.,the diagnosis of potential language disorders,the measurement of effects of an intervention).Despite the interest in these data,there are very few analysis algorithms that can cope with these data.The main difficulties are as follows:First,much of the voice recorded belongs to the infant or child wearing the de-vice,who produce non-speech vocalizations,such as crying as well as non-emotional,non-speech productions.In addition,due to the particularity of the group of children,there are always adult speaking when recording child speech.The other people recorded may vary in their closeness to the microphone,such that their voice alternates between near-field and far-field within the same recording.Finally,the recording device may record the mixed voice of multiple children as well as multiple adults.If we want to make use of children's speech data for other applications,we need to separate child speech as far as possible.Therefore,child speech extraction in real scenes is of great significance in the practical application of child speech.In recent years,as the deep learning method achieves good results in the extraction of adult speech signals,it also provides a feasible way for the separation and extraction of child speech.However,deep-learning-based adult speech separation algorithms are always carried out in a simulation environment.The simulation environment is rela-tively simple,and does not take into account the complexity of the realistic acoustic environment,such as the noise and reverberation that exist at the same time with the speech and the overlapped speech of multiple speakers.Due to the complex adverse condition,the performance of the speech separation algorithm in the realistic condition is degraded or even unusable.Therefore,it is necessary to propose an algorithm for speech extraction of child in realistic condition.This study focuses on the extraction of child speech in realistic condition,and stud-ies how to extract child speech as accurately as possible in complex realistic condition with noise reverberation and multiple disturbances.At the same time,since the task is completed in a real scene,we can't measure some objective indicators such as speech quality and speech intelligibility,so we also need to propose a set of indicators appli-cable to the real scene to measure the quality of child speech extraction.Finally,we propose an adaptive method for child speech separation based on different datasets,so as to further improve the accuracy of child speech extraction.First,we proposed a child speech separation model based on progressive learn-ing.In order to verify the possibility of separating child speech from adult speech,we firstly used identity vector and multi-dimensional scaling to verify the difference of child speech and adult speech.Then,we adopted progressive learning to construct a progressive long-short term memory(LSTM)neural network for child speech separa-tion.Through the test on the simulation set,the progressive LSTM network can achieve better speech intelligibility and speech quality than the baseline LSTM network on child speech separation task.Moreover,when tested on the realistic dataset,our model was able to get a better speech quality than the baseline model.Secondly,in order to make the whole system achieve the ideal effect in the real scene,we proposed the child speech extraction framework using joint speech enhance-ment and speech separation.We first made a series of improvements to the speech separation model.For the training dataset,we expanded the training corpus of child speech,added a large number of child speech in real scenes to ensure the diversity and richness of child speech,and added a large number of adult speech to ensure the integrity of phoneme.In terms of model adjustment,we defined the progressive ideal ratio masks and took it into the original model to obtain the progressive multi-target net-work.Then we added a speech enhancement model as a front-end system to suppress the noise before the speech separation model.Also,to evaluate the extraction results in realistic conditions,we proposed several objective measurements,namely,Jaccard error rate and child speech duration error rate.Finally,according to the different characteristics of different child speech datasets,we proposed an adaptive method for child speech separation based on specific datasets.We proposed a two-pass separation strategy.First,we used the proposed progressive multi-target network to separate the child speech of a specific dataset and we believed that the results obtained from the separation model are of high reliability.After that,we used the separated child voice and the corresponding adult voice to construct training set to finetune the model,only updating the parameters of the full connection layer and adding the regularization item at the same time.Then the separation model for a spe-cific dataset is obtained,and the original input is further separated with this separation model to obtain the results of the second separation step.According to the test results of different test sets,the finetuned adaptive model for child speech separation can yield better performance on the child speech extraction task.At the end of this paper,we summarize the whole paper and look forward to the future work of the child speech extraction task.

Keywords/Search Tags:

Child Speech Extraction, Speech Separation, Realistic Condition, Ob-jective Measure, Speech Enhancement, Progressive Learning

PDF Full Text Request

Related items

1	Research On Speech Signal Preprocessing Based On Deep Learning In Complex Environment
2	Speech Enhancement Approaches Under Complex Conditions
3	Speech Enhancement Based On Sparse Representation And Dictionary Learning
4	Study On The Speech Enhancement Method Of The Multiple Speech Signals Separation
5	The Study Of Speech Enhancement Technology For Farfield Speech Recognition System
6	Study On Speech Enhancement And Separation
7	Single Channel Speech Enhancement And Separation
8	Research On Speech Separation And Recognition Based On Deep Learning
9	Study On Speech Separation And Speech Enhancement Methods
10	The Research Of In-Car Speech Enhancement Algorithm Based On Blind Source Separation