Font Size: a A A

Synthetic Speech Detection Using Multi-Domain Features

Posted on:2024-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2568307103475884Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The rapid progress of modern science and technology has enabled biometrics to be widely and deeply integrated for authentication across various fields.Automatic Speaker Verification(ASV)is a critical component of biometrics,utilizing acoustic features to identify individuals,and is extensively utilized in areas such as home security,banking,and other domains.However,an ASV system that cannot differentiate between genuine and spoof speech is vulnerable to malicious spoof attacks.Therefore,detecting spoof speech plays a pivotal role in securing the ASV system.The spoof speech detection system aims to address this issue by extracting features and matching patterns.Currently,spoof speech is primarily obtained through imitation,replay,conversion,and synthesis.Synthetic speech is text-independent,and speech synthesis technology is mature and convenient to obtain,making it easy to obtain high-fidelity spoof speech that poses a significant threat to the security of ASV systems.Hence,studying Synthetic Speech Detection(SSD)is of great theoretical and practical value.To address the challenges of poor robustness and weak generalization ability of acoustic features in current SSD research,this paper explores acoustic features from three perspectives:time,spatial,and frequency domains.The primary innovations of this study are as follows:(1)To enhance the detection accuracy and generalization ability of SSD,a novel approach is proposed that utilizes the time domain information of speech signals,based on Constant Q Modulation Envelope(CQME)features.While synthetic speech can reproduce the approximate contour of the time domain envelope of genuine speech,the local details are still markedly different.This method capitalizes on the unique characteristics of synthetic speech by employing Constant Q Transform(CQT)to derive the time modulation envelope spectrum of a speech signal,extracting its CQME feature,and utilizing a Random Forest(RF)classification model for detecting synthetic speech.Experiments conducted on the ASVspoof 2019 dataset demonstrated significant enhancements in both detection accuracy and generalization performance compared to traditional acoustic features.(2)Traditional SSD methods typically extract acoustic features in the time and frequency domains,while underutilizing texture information.While Local Binary Pattern(LBP)has been applied to synthetic speech detection,its high dimensionality and computational complexity are major challenges.To address these limitations,a novel SSD approach is proposed in this paper,in which spatial domain information of the spectrogram is employed and features are extracted using the Center Symmetric Local Binary Pattern(CSLBP).Firstly,the Short-Time Fourier Transform(STFT)is used to acquire the spectrogram.The CSLBP algorithm then extracts texture information based on the spatial and quantitative relationship of the pixels in the graph.A histogram is used to analyze the texture information and obtain a 64-dimensional feature vector,which is utilized to train the RF model for synthetic speech detection.The experimental results demonstrate that this approach can significantly reduce feature size while improving detection accuracy.(3)To further enhance the efficacy of the SSD method that uses texture information,a novel approach is proposed that utilizes frequency domain information of the spectrogram by extracting Local Phase Quantization(LPQ)features.Firstly,the spectrogram is divided into multiple sub-blocks,and the LPQ algorithm is applied to each sub-block.The resulting LPQ vectors are then subjected to histogram analysis to obtain the input feature for the random forest model,which performs the detection of synthetic speech.Our experimental results demonstrate that this method not only reduces the Tandem Detection Cost Function(t-DCF)of the SSD system but also exhibits stronger generalization ability.(4)Traditional SSD methods relying solely on a single feature extraction method in the time,spatial,or frequency domain face limitations in detecting various types of synthetic spoof speech attacks.To overcome this challenge,a Cross Domain Multi-Feature Fusion(CDMF~2)method is presented that integrates information from the time domain,spatial domain,and frequency domain.The CQME feature is used to capture time domain information,CSLBP feature for spatial domain information,and LPQ feature for frequency domain information.These features are fused at the feature level to obtain the CDMF~2 fusion feature.Experimental results indicate that the CDMF~2method significantly enhances the precision of detection and generalizability of the SSD system,by addressing the limitations of individual feature extraction methods.Moreover,it is robust to noise conditions,making it a promising solution for detecting various types of synthetic spoof speech attacks.
Keywords/Search Tags:Synthetic speech detection, Constant Q modulation envelope, Center symmetric local binary pattern, Local phase quantization, Feature fusion, Random Forest
PDF Full Text Request
Related items