| As the most important and indispensable means of communication in human life and work,speech signal has received high attention in academic research field and application field.Each person’s voice features are unique,and in theory,voice features,or voice prints,are like a human fingerprint: almost no two people have the same voice print.Therefore,the target of identifying different people’s identities can be realized through the recognition of different people’s voice prints.And this technique is called voice print recognition,or speaker recognition.Like face recognition,fingerprint recognition and iris recognition,voice print recognition is a biometrics technology.In the past half century,there has been a lot of research and development on voice print recognition technology.Most of the current voice print recognition technologies focus on human speech,and the traditional voice print recognition technologies are cumbersome and difficult to optimize on the whole.In this paper,we extend the application scenarios of voice print recognition technology and take rhesus monkey speech as the research object.At the same time,an end-to-end neural network based voiceprint recognition model is studied,which maps the speaker’s speech into a high-dimensional embedded space,and determines the similarity by comparing the speaker’s embedded distance.In view of the vocal characteristics of macaque monkeys,this project referred to commonly used artificially designed input features such as MFCC and LPCC,and further introduced interpretable convolutional filters as the input module of the end-toend model we designed.The original waveform of macaque audio was taken as the input.After the convolution filter processing,the input to the Deep Speaker model as the core of the backbone network.Deep Speaker networks can extract frame-level features from statements through feedforward Deep neural networks.Then,the pooling layer and the length normalization layer are used to generate the speaker embedding at the statement level.Triplet loss function is used in network design.At the same time,feature compression and reward modules are introduced to make the model pay attention to the relationship between channels and improve the performance of the model.In this paper,experiments were carried out on the rhesus macaque speech data set,and the effectiveness of the model for rhesus macaque voice pattern recognition was verified by comparison with other models,ablation experiment of its own model,and comparative experimental analysis of different training strategies.Compared with the traditional method and the Deep Speaker method before modification,The method presented in this paper has higher accuracy in voicing recognition and higher integrity in model optimization. |