Font Size: a A A

Research On Improvement Of Speaker Recognition Method Based On RSCNN

Posted on:2020-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:C DaiFull Text:PDF
GTID:2428330572482437Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Speaker recognition is a kind of biometrics,which distinguishes between speakers by extracting features that can represent speaker identity from raw speech,and it has advantages of collection conveniences and high acceptability to users compared to other biometrics,thus it is widely used nowadays.With the advent of Internet and big data age,deep learning gradually dominates speaker recognition field by virtue of its good representation ability compared to traditional shallow model.In this paper,we mainly focus on a kind of convolutional neural network model based on raw speech input(RSCNN)and its application on speaker recognition.The RSCNN model can directly learn appropriate raw speaker features from speech data,and it is more independent of specific priori knowledge than deep learning architecture which is based on spectrum feature input.Our work is based on related research work on RSCNN,and proposes some improvement projects to it.First,in view of the deficiencies of model fusion method of which the computational cost is relatively high,we propose a feature fusion method,which fuses two kinds of features extracted by two convolution kernels in the first convolution layer of RSCNN,in which the two convolution kernels have dififrernt width.We then compare the feature fusion method with model fusion method on precision and training time by conducting several contrast experiments.The experimental results on three public datasets indicate that the proposed feature fusion method has little difference in precision compared to model fusion method,but it can observably shorten the training time,which shows its effectiveness.Second,we try to use additional two scales of features based on aforementioned two-scale feature fusion method,and use 4 different kernel widths to extract 4 kinds of features which have different scales in parallel and fuse them.The experimental results indicate that within certain limits the recognition performance of model will be better when the number of feature scales increases.Finally,we design the model transfer experiments,in which we transfer the models trained on three public datasets to self-built dataset and finetune them.The experimental results prove that the transferred RSCNN model can extract speaker features that have a certain extent of invariance on new datasets,and the trained feature extraction module will have better generalization performance to new datasets when the example diversity of the original dataset enhances,meanwhile the proposed feature fusion method will have more promotion on generalization performance of transferred model.
Keywords/Search Tags:Speaker Recognition, Deep Learning, Feature Fusion
PDF Full Text Request
Related items