Font Size: a A A

Research On Dialect Identification Based On Deep Learning

Posted on:2024-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q B LuoFull Text:PDF
GTID:2568307076997779Subject:Robotics Engineering (Professional Degree)
Abstract/Summary:PDF Full Text Request
Language identification is a technique that determines the language of a speech signal based on its acoustic features,such as frequency,amplitude,and pitch.Language identification has important applications in various fields,such as speech recognition front-end,cultural preservation,and national security.The methods for language identification mainly include traditional model-based methods and deep learning-based methods.Currently,deep learning-based methods have made significant progress in the field of language identification.Dialect identification is considered a special case of language identification,involving the recognition of different dialects within the same language family.Since the differences between dialects are often small,dialect identification is more challenging than language identification.Compared to language identification,there is relatively less research on dialect identification and existing methods perform poorly in low-resource scenarios.This paper aims to explore methods to improve the performance of dialect identification.The main work and contributions of this paper are as follows:(1)Two novel Time Delay Neural Network(TDNN)architectures are proposed:Dynamic Kernel Channel Attention TDNN(DKCA-TDNN)and Multi-Scale Channel Adaptive TDNN(MSCA-TDNN).TDNN can process multiple frames of speech features simultaneously and capture contextual information.However,traditional TDNNs mainly focus on temporal information and ignore the importance of channel information.This study unifies the optimization of temporal and channel information extraction at the network level.DKCA-TDNN adopts dynamic kernel convolution and local channel attention modules.Dynamic kernel convolution can adaptively extract appropriate contextual information,while local channel attention can establish finer-grained channel dependencies.MSCA-TDNN introduces multi-scale convolution and multi-scale channel attention modules.Multi-scale convolution can obtain receptive fields of different sizes,while multi-scale channel attention can weight-correct the extracted multi-scale channel features to obtain key discriminative features.These two models were evaluated on the Arabic Dialect Identification(ADI17)dataset.A balanced fine-tuning strategy was proposed to address the data imbalance problem in ADI17.In addition,a Z-Score normalization method was proposed to eliminate the score distribution differences between different dialects and improve recognition performance.Finally,score fusion of these two TDNN models achieved an average cost performance(Cavg)of 3.36%and an accuracy of95.20%.Compared with the best results reported in the literature,the Cavg of the proposed system was relatively reduced by 22%,proving the effectiveness of TDNNs that focus on multi-scale and channel information in the identification of easily confused Arabic dialects.(2)The application of the wav2vec 2.0 pre-trained model in dialect identification was explored.Current mainstream dialect identification systems use end-to-end neural network models that highly depend on labeled training datasets.However,collecting labeled dialect datasets is an expensive task and if labeled data is scarce,it can severely affect model performance.In this paper,by fine-tuning the target dialect on a cross-lingual pre-trained model,good recognition results were achieved.In addition,a multi-scale aggregation graph neural network was proposed for fine-tuning the backend to implicitly utilize phoneme sequence information.The system was evaluated on the dialect identification task of the 2020 Oriental Language Recognition(AP20-OLR)Challenge.Experimental results show that the proposed system significantly outperforms existing state-of-the-art systems,with a relative reduction of50%in Cavg.The effectiveness of the proposed backend network was also validated,with a relative reduction of 54%in Cavg.This study emphasizes the necessity of integrating an effective backend network during fine-tuning when using wav2vec 2.0 pre-trained models to improve low-resource dialect identification.
Keywords/Search Tags:dialect identification, deep learning, time-delay neural network, self-supervised learning, graph neural network
PDF Full Text Request
Related items