Research On Multi-person Speech Recognition Based On Deep Learning

Posted on:2022-11-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2518306788955939

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

With the development of science and society,people also put forward more demanding requirements on speech recognition technology.Among them,single-person speech recognition technology has been developed to a high level,but the recognition effect for multi-person speech signals is not satisfactory,mainly in the difficulty of determining the identity of the speaker,i.e.,which person a certain speech is spoken by.To address this problem,this paper combines speech separation and speaker recognition to propose a recognition technique for multi-person mixed speech signals,which is mainly used for identity identification of multi-person speech signals and is not concerned with the recognition of speech content.The research content of this paper mainly includes two parts: speech separation and speaker recognition.1?Speech separation: most of the commonly used speech separation models at this stage are based on recurrent neural networks,which cannot effectively use the spatial feature information of the speech signal.This paper proposes a CNN-GRU-Attention model based on Convolutional Neural Network(CNN),Gated Recurrent Unit(GRU)and Attention Mechanism,which takes the amplitude spectrum as input,extracts the spatial features of the amplitude spectrum by CNN,and uses GRU to model the temporal information.The attention mechanism module Attention Cell Wrapper is introduced into the model for the problem of easy loss of long sequence information,so that the neural network can identify the importance of each part with the help of sequence information and improve the speech separation effect.The superior performance of the model compared with the traditional speech separation model was verified through comparison experiments,and the global normalized signal distortion ratio(GNSDR)reached 7.8 d B and the global signal interference ratio(GSIR)reached 13.8 d B.2?Speaker recognition: A speaker recognition model based on residual neural network,gated recurrent unit and attention mechanism is established for the speaker recognition problem.The speech signal is processed by pre-emphasis and feature parameter extraction,and then input to the residual network to extract feature information.Since the convolution process generates a large number of channels containing redundant information such as noise and silent segments,the attention mechanism module SEnet is introduced to improve the model for this problem,giving more attention to the channels containing important information to improve the recognition effect.Then the temporal information is processed by GRU network.The commonly used cross-entropy letter loss function performs generally in recognizing similar samples,so the triplet loss function is chosen to train the network.Finally,comparison experiments are designed,and the experimental results show that the speaker recognition model proposed in this paper has an equal error rate of 4% and a recognition accuracy of 91.5%,which is better than the traditional Gaussian mixture model and the DNN-based i-vector method.

Keywords/Search Tags:

multi-person speech recognition, speech separation, speaker recognition, convolutional neural network, attention mechanism

PDF Full Text Request

Related items

1	Multi-speaker Speech Separation Based On Deep Learning
2	Short Speech Speaker Recognition Method Based On Deep Learning And Its Application In Speech Separation
3	Monaural Multi-speaker Speech Separation And Recognition
4	Research On Speaker Adaptation Of Neural Network Acoustic Models For Speech Recognition
5	Emotion Recognition Using User Speech
6	Research On Unspecified Person Speech Emotion Recognition Based On Neural Network
7	Research On Speech Preprocessing Of Speech Recognition For Multi-talker Conversations In Complex Acoustic Environments
8	Speaker Emotional State Recognition Based On Speech And Text Fusion
9	Research On Speech Separation And Recognition Based On Deep Learning
10	Research On Speech Emotion Recognition Based On Multi Features Fusion