Font Size: a A A

Research On Machine Learning Based Speaker Recognition

Posted on:2022-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:J Y MoFull Text:PDF
GTID:2518306494950859Subject:Electrical engineering
Abstract/Summary:PDF Full Text Request
Biometric recognition technology has been widely employed in nowadays society because of its convenience and safety.As an important biological feature,the human voice contains abundant information.Besides,due to the widespread use of smart devices,the collection of speaker voice requires very low cost.Therefore,the analysis of speech voice is of high practical value.In this work,both the speaker recognition and speech emotion recognition with deep learning method are discussed.The speaker recognition is divided into speaker identification and speaker verification,while the speech emotion recognition is directly treated as a multi-class classification task.To take advantage of different attention mechanisms,a dual-path attention mechanism is proposed in this paper,combining the self-attention and convolutional block attention mod-ule.With the proposed method,the performance is significantly improved with neglectable extra time burden.Based on the Cluster-Range Loss(CRL)which is an improved version of Triplet Loss,a Weighted Cluster-Range Loss(WCRL)is presented in this work to improve the performance of CRL in speaker identification task.The WCRL focuses more on the increase of inter-class difference,and leads to a higher classification accuracy of critical samples.To address the problem of low efficiency of CRL in the initial training stage,a novel Criticality-Enhancement Loss(CEL)is also proposed.The CEL pays attention to the most easily and necessarily optimized samples.Combined with CRL,both the hardest and the easiest samples are considered concurrently per step.Therefore,the training process is hugely speeded up,and relatively more time is obtained for CRL.As a result,better performance is achieved.For speaker identification task,a Top-1 accuracy of 92.0% on VoxCeleb1 dataset and84.3% on CNCeleb dataset were reached.For speaker verification task,when trained on the VoxCeleb1 dataset,an equal erro rate(EER)of 5.1% was achieved.When trained on the Vox-Celeb2 dataset,the EER was further reduced to 3.52%.Compared with the baseline methods,the approaches proposed in this work show obvious superiority.As for speech emotion recognition,a light-weight architecture combining Res Net and GRU is proposed in this paper.Compared with the methods from other researchers,competitive per-formance on IEMOCAP dataset was reached with less parameters and features using the pro-posed architecture.An unweighted accuracy(UA)of 67.9%,F1-score of 0.675 were achieved,and the parameter amount was relatively reduced by 16.2%.
Keywords/Search Tags:Speaker Recognition, Emotion Recognition, Attention, Loss
PDF Full Text Request
Related items