Font Size: a A A

Speech Anti-spoofing Based On Deep Learning

Posted on:2024-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2568307103975709Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Among the biometric recognition technologies,the voice-based biometric authentication system has become a popular solution for the current security system due to the convenience of voice acquisition,high degree of discrimination,and excellent collection equipment.However,the rapid development of voice spoofing attack technology has made the spoof speech generated by the latest spoofing attack models indistinguishable from the real speech in sensory terms.This situation strongly impacts the security of voice-based biometric authentication systems and seriously threatens privacy security of public.For this reason,scholars put forward spoof speech detection to find effective anti-spoofing methods to strengthen the security of the speech system.Although the existing detection methods have made considerable progress,there are still problems,such as single detection acoustic environment,unsatisfactory detection effect in actual application scenarios,poor generalization ability to unknown spoofing attacks,large detection model and complex calculation.In this paper,deep learning and spoof speech detection are combined to solve the above problems.The main work contents are as follows:Aiming at the problem of complex acoustic environment interference such as background noise,reverberation and channel interference in practical application scenarios and the problem that the generalization performance of the system is affected because of insufficient information provided by a single class acoustic feature,this paper proposes a multi-feature joint detection model based on Deep Residual Shrinkage Networks(DRSN).This detection model includes Residual Shrinkage Building Unit(RSBU)and multi-feature joint detection unit.The adaptive threshold learning module in the RSBU uses the deep attention mechanism to monitor the audio status of each voice channel,and then determines an independent interference threshold according to the acoustic environment level of the voice channel,which no longer requires professionals to perform environmental state monitoring manually.The subsequent soft threshold module can flexibly and autonomously eliminate the information related to interference according to the independent interference threshold,highlighting the information with high discrimination.The multi-feature joint detection unit selects suitable features to build a single class feature-DRSN detection sub-model,and carries out weighted joint processing based on the detection efficiency of the single class feature-DRSN detection submodel with corresponding weights,so as to balance the information content contained in different features and achieve complementary advantages to enhance the generalization capability of the system.Aiming at the problem of irreversible information loss in the process of hand-crafted acoustic features production and the problem that the detection model with complex structure is not conducive to the convenient detection of the system,this paper explores the end-to-end structure of the detection model and proposes Raw Cross-dimension Interaction Attention Network(Raw CIANet)based on raw-audio waveform,which uses raw-audio waveform instead of hand-crafted acoustic features as network model input to avoid complex feature extraction work and unnecessary information loss.The detection network firstly investigates data augmentation methods for the raw-audio waveform,using Random Channel Masking Augmentation(RCMA)to mine the implicit correlation between the time and frequency dimensions on the raw-audio waveform and to strengthen the highly discriminative information in the time and frequency domains to realize the data augmentation of the raw-audio waveform.Secondly,the network uses a lightweight cross-dimensional interaction attention module to capture time and frequency domain information while exploring explore more interactiondependent information in both time and frequency domains,so as to maximize the excavation of speech camouflage cues across different frequency subbands and time periods at a low computational cost.Finally,this paper explores two model-level attention fusion approaches to achieve efficient aggregation of information.In order to realize the convenient use of spoofing detection model,this paper designes a integrated spoof speech detection system.The front-end interactive page of the system provides convenience for user operation,and the back-end detection system model organically combines the performance advantages of multi-feature joint detection model and cross-dimensional interactive attention detection network model.The system is experimentally verified to have good detection capability.
Keywords/Search Tags:Spoof Speech Detection, deep learning, complex acoustic environments, attention mechanism
PDF Full Text Request
Related items