Synthetic Speech Detection Using Multi-Domain Features

Posted on:2024-02-21

Degree:Master

Type:Thesis

Country:China

Candidate:J Xu

Full Text:PDF

GTID:2568307103475884

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The rapid progress of modern science and technology has enabled biometrics to be widely and deeply integrated for authentication across various fields.Automatic Speaker Verification(ASV)is a critical component of biometrics,utilizing acoustic features to identify individuals,and is extensively utilized in areas such as home security,banking,and other domains.However,an ASV system that cannot differentiate between genuine and spoof speech is vulnerable to malicious spoof attacks.Therefore,detecting spoof speech plays a pivotal role in securing the ASV system.The spoof speech detection system aims to address this issue by extracting features and matching patterns.Currently,spoof speech is primarily obtained through imitation,replay,conversion,and synthesis.Synthetic speech is text-independent,and speech synthesis technology is mature and convenient to obtain,making it easy to obtain high-fidelity spoof speech that poses a significant threat to the security of ASV systems.Hence,studying Synthetic Speech Detection(SSD)is of great theoretical and practical value.To address the challenges of poor robustness and weak generalization ability of acoustic features in current SSD research,this paper explores acoustic features from three perspectives:time,spatial,and frequency domains.The primary innovations of this study are as follows:(1)To enhance the detection accuracy and generalization ability of SSD,a novel approach is proposed that utilizes the time domain information of speech signals,based on Constant Q Modulation Envelope(CQME)features.While synthetic speech can reproduce the approximate contour of the time domain envelope of genuine speech,the local details are still markedly different.This method capitalizes on the unique characteristics of synthetic speech by employing Constant Q Transform(CQT)to derive the time modulation envelope spectrum of a speech signal,extracting its CQME feature,and utilizing a Random Forest(RF)classification model for detecting synthetic speech.Experiments conducted on the ASVspoof 2019 dataset demonstrated significant enhancements in both detection accuracy and generalization performance compared to traditional acoustic features.(2)Traditional SSD methods typically extract acoustic features in the time and frequency domains,while underutilizing texture information.While Local Binary Pattern(LBP)has been applied to synthetic speech detection,its high dimensionality and computational complexity are major challenges.To address these limitations,a novel SSD approach is proposed in this paper,in which spatial domain information of the spectrogram is employed and features are extracted using the Center Symmetric Local Binary Pattern(CSLBP).Firstly,the Short-Time Fourier Transform(STFT)is used to acquire the spectrogram.The CSLBP algorithm then extracts texture information based on the spatial and quantitative relationship of the pixels in the graph.A histogram is used to analyze the texture information and obtain a 64-dimensional feature vector,which is utilized to train the RF model for synthetic speech detection.The experimental results demonstrate that this approach can significantly reduce feature size while improving detection accuracy.(3)To further enhance the efficacy of the SSD method that uses texture information,a novel approach is proposed that utilizes frequency domain information of the spectrogram by extracting Local Phase Quantization(LPQ)features.Firstly,the spectrogram is divided into multiple sub-blocks,and the LPQ algorithm is applied to each sub-block.The resulting LPQ vectors are then subjected to histogram analysis to obtain the input feature for the random forest model,which performs the detection of synthetic speech.Our experimental results demonstrate that this method not only reduces the Tandem Detection Cost Function(t-DCF)of the SSD system but also exhibits stronger generalization ability.(4)Traditional SSD methods relying solely on a single feature extraction method in the time,spatial,or frequency domain face limitations in detecting various types of synthetic spoof speech attacks.To overcome this challenge,a Cross Domain Multi-Feature Fusion(CDMF~2)method is presented that integrates information from the time domain,spatial domain,and frequency domain.The CQME feature is used to capture time domain information,CSLBP feature for spatial domain information,and LPQ feature for frequency domain information.These features are fused at the feature level to obtain the CDMF~2 fusion feature.Experimental results indicate that the CDMF~2method significantly enhances the precision of detection and generalizability of the SSD system,by addressing the limitations of individual feature extraction methods.Moreover,it is robust to noise conditions,making it a promising solution for detecting various types of synthetic spoof speech attacks.

Keywords/Search Tags:

Synthetic speech detection, Constant Q modulation envelope, Center symmetric local binary pattern, Local phase quantization, Feature fusion, Random Forest

PDF Full Text Request

Related items

1	Research On Texture Feature Extraction Of Spectrogram Image For Speech Emotion Recognition
2	Facial Expression Recognition On Local Binary Pattern And Local Phase Quantity
3	A Method For Detection Of Spoofing Speech Based On Joint Features And Random Forest
4	Face Recognition Based On Improved Center-symmetric Local Binary Pattern
5	Face Recognition Research Based On Improved Local Directional Pattern
6	Research On Scene Classification Technologies With The Local Region Description Feature And Probabilistic Latent Semantic Analysis Model
7	Research On Facial Expression Recognition Based On The Fusion Of Visible Images And Infrared Images
8	Research On Face Detection Algorithm Based On Machine Learning
9	Study Of Facial Expression Recognition Based On Partial Occlusion
10	Face Recognition Fusion Multi-feature And Local Binary Pattern