Font Size: a A A

Research On Multi-person Speech Recognition Based On Deep Learning

Posted on:2022-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2518306788955939Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the development of science and society,people also put forward more demanding requirements on speech recognition technology.Among them,single-person speech recognition technology has been developed to a high level,but the recognition effect for multi-person speech signals is not satisfactory,mainly in the difficulty of determining the identity of the speaker,i.e.,which person a certain speech is spoken by.To address this problem,this paper combines speech separation and speaker recognition to propose a recognition technique for multi-person mixed speech signals,which is mainly used for identity identification of multi-person speech signals and is not concerned with the recognition of speech content.The research content of this paper mainly includes two parts: speech separation and speaker recognition.1?Speech separation: most of the commonly used speech separation models at this stage are based on recurrent neural networks,which cannot effectively use the spatial feature information of the speech signal.This paper proposes a CNN-GRU-Attention model based on Convolutional Neural Network(CNN),Gated Recurrent Unit(GRU)and Attention Mechanism,which takes the amplitude spectrum as input,extracts the spatial features of the amplitude spectrum by CNN,and uses GRU to model the temporal information.The attention mechanism module Attention Cell Wrapper is introduced into the model for the problem of easy loss of long sequence information,so that the neural network can identify the importance of each part with the help of sequence information and improve the speech separation effect.The superior performance of the model compared with the traditional speech separation model was verified through comparison experiments,and the global normalized signal distortion ratio(GNSDR)reached 7.8 d B and the global signal interference ratio(GSIR)reached 13.8 d B.2?Speaker recognition: A speaker recognition model based on residual neural network,gated recurrent unit and attention mechanism is established for the speaker recognition problem.The speech signal is processed by pre-emphasis and feature parameter extraction,and then input to the residual network to extract feature information.Since the convolution process generates a large number of channels containing redundant information such as noise and silent segments,the attention mechanism module SEnet is introduced to improve the model for this problem,giving more attention to the channels containing important information to improve the recognition effect.Then the temporal information is processed by GRU network.The commonly used cross-entropy letter loss function performs generally in recognizing similar samples,so the triplet loss function is chosen to train the network.Finally,comparison experiments are designed,and the experimental results show that the speaker recognition model proposed in this paper has an equal error rate of 4% and a recognition accuracy of 91.5%,which is better than the traditional Gaussian mixture model and the DNN-based i-vector method.
Keywords/Search Tags:multi-person speech recognition, speech separation, speaker recognition, convolutional neural network, attention mechanism
PDF Full Text Request
Related items