Font Size: a A A

Research On Language Identification Based On Temporal Feature Representation

Posted on:2022-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2518306614960009Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Language identification(LID)is an important part of the front-end processing of speech and an important interface for future human-computer interaction.Its accuracy and efficiency have a great impact on the development of intelligent systems.It also has important scientific and practical value.Since the information on temporal variability is an important basis for describing discriminative features,the accuracy and efficiency of the temporal modelling approach have a significant impact on language recognition systems.However,existing temporal methods can only capture the last hidden layer state of the temporal sequence,which will inevitably ignore the important temporal information between hidden state sequences.To handle these problems,this work proposes several targeted methods that can encode the temporal information of sequence features in hidden layers of recurrent networks or convolutional networks more efficiently and accurately.In order to obtain higher-order dynamic discriminative features of audio and reduce the error rate of language identification systems.In this paper,three methods are proposed which all take the sequence and temporal models as the entry point to improve the temporal extraction of sequence features in terms of feature pooling and encoding methods of language recognition systems.Firstly,based on the traditional sequence model,the attention space representation is added to enhance the encoded features.Then a temporal pooling method based on empirical risk minimisation is explored.Finally,the sequence model and this temporal pooling method are combined.The main innovative research contents and innovations of this paper are as follows.(1)A language identification method based on convolutional neural networks-bidirectional long and short term memory network(CNN-Bi LSTM)and multi-head attention pooling(MHAP).This method solves the problem of limited expressiveness of features of self-attentive pooling layer.Since different attention heads represent local patterns of different subsequences,the multi-head attention pooling layer can learn multiple attention representations of sequence features.The method first trains a combined model of Residual Convolutional Networks(Res Net)and Bi LSTM as a variable-length front-end local feature extractor to encode features,the multi-head attentional pooling layer decodes that features as a fixed-dimension utterance-level feature representation.Experimental results show that the MHAP-based pooling method can reduce the error rate of the language identification system compared to other Bi LSTM-based methods.(2)A language recognition method based on convolutional neural networks and Temporal Pooling Unit(TPU)is proposed to solve the problem of limited temporal-dependence information obtained by recurrent models such as Recurrent Neural Networks(RNN)and Long Short Term Memory Networks(LSTM).In addition,such sequential models introduce more parameters.Therefore,it require more training data to obtain the required model precision.This method uses a residual neural network(Res Net50)with the temporal pooling unit being a Support Vector Regression(SVR)machine,which can efficiently and accurately encode the output sequences of the residual neural network to obtain higher order distinguished dynamic features of the input audio.Experimental results show that,compared with other Bi LSTM methods,the proposed method can improve the performance of the language recognition system.(3)A language identification method based on CNN-Bi LSTM and temporal pooling unit is proposed,which enhances the regression characteristics of the sequences by improving temporal smoothing for the local feature sequences produced by the neural network.At the same time,to handle the problem that the sequence model only considers the temporal evolution of the end features of the hidden states resulting in partial loss of temporal dynamics,the method can capture the global temporal dynamics of the accumulated states in the hidden layer of the recurrent model and preserves the dynamic trends of the accumulated features throughout the temporal evolution.This accumulated dynamic has the same discrimination as the dynamic information of the original feature sequence.Experimental results show that the proposed method reduces the error rate of the language identification system compared with the Bi LSTM method.
Keywords/Search Tags:Language recognition, Convolutional neural networks, Bidirectional long and short-term memory networks, Temporal pooling
PDF Full Text Request
Related items