The evolution of artificial intelligence technology has brought numerous conveniences to daily life.In particular,auto speech recognition(ASR)systems based on deep learning have greatly enriched human-machine interaction scenarios.The swift and accurate transcription capability enables users to input information anytime and anywhere.However,the malicious exploitation of ASR technology also brings security risks.Eavesdroppers of big data can transcribe massive speech content into easily searchable text information using ASR in scenarios where users are unaware,search for interesting content,and identify user identities,posing a risk of privacy leakage to users.Traditional defense mechanisms such as authentication and encryption may lower the risk of information leakage while transmitting speech content over the Internet to some extent.However,due to the low information density of voice files,encryption operations require substantial computing resources at the terminal.Although authenticated and encrypted transmission of voice data can solve the risk of leakage on the channel,it still cannot defend against curious service providers using ASR to pry into speech content,causing the leakage of user identity or behavioral privacy.To address this problem,this dissertation conducts research on universal and stealth adversarial-example-generation techniques for speech recognition,where users proactively add adversarial perturbation to the speech uploading to the Internet,making it difficult for ASR to recognize such voices effectively,thereby improving the security of using and transmitting speech information.The main research contents are as follows:1.A universal adversarial example generation technique based on model feature pollution is proposed.In order to address the efficiency issue when quickly interfering with multiple speech inputs,the model feature pollution technique is studied,generating perturbations that can interfere with the transcription results of multiple speech inputs and further reduce the data dependency of multi-data iteration methods.Optimization goals are set according to the output matrices of each network layer in ASR.A measurement method is proposed to describe the degree of confusion between output matrices of different layers,so that the optimization direction of adversarial perturbations can be towards maximizing the interference between output matrices of different layers,starting from a random perturbation sequence to get a universal adversarial perturbation.Based on this,network layer measurement strategies are determined based on the timing characteristics of speech recognition neural networks,and perturbation adding strategies are determined based on the variable length of speech data,further improving the disturbing effect of universal perturbations on original speech transcription text.The effectiveness of the proposed method is verified through experiments on multiple models,achieving word error rate up to 40%.Additionally,the method exhibits a certain level of transferability across other models.2.A perceptually concealed adversarial example generation technique based on noise space constraints is proposed.To address the difficulty of reducing the noise level of universal perturbations,the noise space constraint technique is studied,and a distribution rule for adversarial examples based on frame structure is proposed,which can reduce the range of perturbation at the cost of only small increase in the magnitude of perturbation,thereby converting the background noise into noise points.Firstly,the perturbation analysis of the frame segmentation and windowing structures in feature extraction is carried out.Then,a method of measuring the adversarial example space and evaluation based on iterative attenuation of perturbation level is designed,and the effect of composite factors on the adversarial example space is explored.Finally,cross-experiments of noise space constraints are designed based on the frame type in ASR,which verifies that the adversarial example space in models with frame synchronization structure is mainly influenced by coupling effect.When generating adversarial examples for specific carriers,the coupling effect should be avoided,while when generating universal adversarial examples for non-specific carriers,the coupling effect should be utilized,thus to restrict noise levels within 60%,40%,and 33%ranges,respectively,at the cost of different degrees of increase in magnitude.This provides a new perspective for generating high-quality speech adversarial examples.3.A detection-concealed adversarial example generation technique based on confidence enhancement is proposed.To address the problem of adversarial examples being easily detected by detection methods based on statistical features in neural networks,confidence enhancement techniques are studied to improve the logits similarity between adversarial examples and normal speech.Firstly,the difference in logits distribution between adversarial examples and normal speech in the ASR model is analyzed,and the three-dimensional matrix is quantitatively described to fully utilize this difference.On this basis,it is proposed that the abnormality of logits statistics among individuals can be compensated by setting additional loss functions in the optimization process,and a confidence enhancement algorithm for generating speech adversarial examples with similar logits distributions to normal speech is designed to bypass these detection methods.Experimental results on the dataset show that adversarial examples can only be detected with a success rate of about 60%,proving that the statistical feature differences in logits between adversarial examples and normal speech can be compensated. |