Font Size: a A A

Research On Voiceprint Recognition Model Based On End-to-end Neural Network

Posted on:2021-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y F DongFull Text:PDF
GTID:2428330629951052Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In the 21 st century with the continuous development and progress of modern information technology,the technology of identity authentication based on biological characteristics is also constantly improving and maturing.In more than 50 years of development,voiceprint recognition has gradually become commercialized because of its unique advantages such as long-distance and multi-device data collection.However,the large-scale speech dataset on the Internet contains various problems,such as multichannel,multiple background noise,short audio duration,etc.,while the traditional voiceprint recognition method not only has tedious steps,but also significantly reduces the performance of the model under the condition of large data.In view of the above problems,this thesis mainly studies the end-to-end voiceprint recognition model based on neural network which maps different utterances into a high dimensional embedding space so that the similarity between speakers can be compared through the distance between embeddings.Firstly,this thesis selects FBank as acoustic features of the end-to-end model and proposes a backbone network based on frequency domain convolution called Res-FDCNN.This backbone network uses the building block in the residual network and the independent convolutional layer to stack repeatedly to extract the high dimensional frame-level features.At the same time,the frequency convolution layer is added as the last convolution layer to focus on learning the frequency domain information and the temporal pooling layer is extracted to the deep speaker embeddings.The experiments show that the Res-FD-CNN backbone network can also achieve great results under the premise of less computation than the standard ResNet architecture.Secondly,this thesis combines Res-FD-CNN and triplet loss function to form a voiceprint recognition model based on Euclidean distance between features.The model is pre-trained through the Softmax loss function,which can preliminarily form the classification surface in the high dimensional embedding space and reduce the training difficulty of the triplet loss.Two different triplet mining strategies are compared in the experiments,in which the effect of training the hardest triplets based on the pre-training model is better than that of training all the hard triples.Finally,this thesis constructs an end-to-end voiceprint recognition model based on classification network which selects the improved A-Softmax based on angle as the loss function.This model uses a training method of stitching different short-duration speeches under the same speaker to keep an angular interval between the different classes of features in the embedding space.The experiments verify that the voiceprint model based on the A-Softmax loss is better than the model based on the triplet loss,and the loss function based on angle interval is more suitable for learning category-discriminating deep speaker embeddings and voiceprint recognition model with strong generalization ability from large-scale and multi-category speech training datasets.
Keywords/Search Tags:Voiceprint recognition, Deep speaker embedding, A-Softmax loss function, Triplet loss function, End-to-end model
PDF Full Text Request
Related items