| In recent years,with the rapid development of artificial intelligence,a variety of intelligent applications and devices appear widely.Some traditional production and life styles are changed,but the security of artificial intelligence is vulnerable.The appearance of spoofed images,audios and the appearance of Adversarial Example Attack all become a great threat to artificial intelligence.Trust worthy AI,as a hot research topic,makes artificial intelligence system not easy to be attacked through various defense and fault tolerance technologies.In this thesis,we focus on the attack of speeches,and propose corresponding fault-tolerant methods to improve the robustness of speech recognition system.There are three main attack methods:Speech Adversarial Example Attack,Replay Attack and Synthetic Speech Attack.Firstly,this thesis proposes two kinds of fault-tolerance methods based on speech preprocessing and model robust training.It makes the sample cross the decision boundary by adding noise and is classified to another class.Because the essence of disturbance is noise,the pretreatment method of noise reduction can effectively eliminate the disturbance,and the robust training of the model can effectively enhance the decision boundary,so that the adversarial examples can not easily cross the decision boundary.The final experimental results show that in the short instruction,the recognition accuracy of the method based on noise reduction can reach 81.85%,while the method based on model robust training can reach 89.61%,which is greatly improved compared with the original model.In the long speech recognition,the recognition accuracy of the method based on noise reduction can reach 76.79%.Secondly,in this thesis,a feature based on the short-time zero crossing rate is proposed for replay attack,which is called AZsil.Due to the noise of recording equipment and environmental noise,the difference between recording speeches and original speeches is obvious,specially in the silent parts.This thesis uses the difference to detect abnormal silent segments of each speech.The difference between noise and original silent part can be reflected by short-time zero crossing rate.The noise zero crossing rate is greater than that of the silent part.Therefore,we add up the short-time zero crossing rate of all silent parts and calculates the average value,and then the AZsil characteristics can be obtained to detect the replay attack.The final experimental results show that the AZsilfeature can effectively detect the playback recording,and the detection accuracy is94.61%.Next,this thesis proposes a segment-based feature extraction method for synthetic speech attack.There are many kinds of synthetic speech attacks,so it is more difficult to detect them than recording and replay attacks.This thesis analyzes the characteristics of synthetic speech by means of waveform analysis and cluster analysis.After analysis,most of the synthetic speech has synthesis defects in the silent segment,because the silent segment can learn less features,and through clustering analysis,using shorter features for model training,the detection effect is better.In this thesis,the speech is divided into the silent part and the vocal part,and the features of AZsil and Word Constant Q Cepstrum Coefficient(WCQCC)are used for extraction respectively.AZsil feature is similar to playback recording,but it can’t detect all synthesized speech.Therefore,this thesis improves CQCC features to form WCQCC features,which is the splicing of CQCC features of each word in the speech,and the features are more concentrated.Secondly,we propose a biased decision strategy(BDS)to evaluate the judgment results of the two features comprehensively to complete the final detection.The final experimental results show that BDS evaluation can make the detection accuracy of synthetic speech reach94.77%.Finally,because the speech recognition system may be threatened by three kinds of attacks at the same time,this thesis proposes a series parallel fault-tolerant scheme for speech attacks in real environment,and makes comprehensive fault-tolerant for the three kinds of attacks.The final results show that the proposed comprehensive fault-tolerant scheme can effectively detect fake speech,and the detection accuracy is 91.58%. |