Font Size: a A A

Deep Learning Based Spoken Language Identification

Posted on:2016-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:B JiangFull Text:PDF
GTID:1228330470957957Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Spoken language identification (LID) is an automatic process that aims to deter-mine the language identity of a given speech segment. With the rapid progress of the communication techniques, LID plays a more and more important role for multilingual speech processing system. A significant progress in LID technique has been witnessed in the past few decades. The performance is comparable or even better than human for long duration speech segments. However, it is still far from satisfactory, especially for short-duration test condition and highly confusable dialects. This is largely caused by the fact that language information is latent and dependent heavily on the statistics col-lected from the speech segments. The major challenge of LID system is how to design an effective representation of each speech utterance that specific to language informa-tion. This is closely related to the front-end feature extraction and back-end modeling techniques.Currently, the front-end features for LID system include spectral and phonetic ones. This is insufficient for challenging short duration and confusable dialect recognition. This thesis investigates the use of deep learning theory for extracting robustness feature and improving model capability for language recognition.Firstly, we proposed using Deep Bottleneck Features (DBF) for spoken LID, mo-tivated by the success of Deep Neural Networks (DNN) in speech recognition. DBFs were generated by a structured DNN containing a narrow internal bottleneck layer with the phonemes or phoneme states as targets. Since the number of hidden nodes in the bottleneck layer is much smaller than those in other layers, DNN training forces the activation signals in the bottleneck layer to form a low-dimensional compact represen-tation of the original inputs and DBFs can be considered more robust to the variations caused by different speakers or channels, specific content of the speech, and background noise. Experimental results demonstrated that an acoustic representation based on DBF followed by total variability modeling (TV or ivector) technique significantly improves on state-of-the-art performance, especially for the tasks with short duration utterances and between confusing languages. In addition, we presented a parallel DBF-TV system which can make full use of several language specific DBFs to further promote perfor-mance. Each DBF is extracted from a DNN which is trained on the corpus for a special language.Secondly, we extended DBF by proposing a more discriminative deep bottleneck feature (D2BF) for LID task. This is accomplished by tuning the DNN parameters espe-cially the bottleneck layer paramters in the trained DBF extractor using a discriminative criterion and LID training corpus, which make the D2BF more discriminative and task-aware. Specifically, the Maximum mutual information (MMI) criterion, with gradient descent, is applied to update the DNN parameters of bottleneck layer iteratively. The results show that D2BF is more appropriate and effective than DBF, especially without score backend.Thirdly, an alternate approach to the-state-of-the-art DBF-TV system was pro-posed. In this method, the Gaussian mixture model (GMM) of universal background model (UBM) is estimated using a supervised and discriminative way. The DNN re-places the unsupervised clustering process to compute the posterior probabilities of the classes. Thus each Gaussian component of this discriminative GMM is stringent corre-sponding to one of the output senones (phonemes or phoneme states). Once the GMM has been trained, the TV is performed in the conventional way. We also investigate the method that using DNN to compute the frame posteriors with respect to each of the classes in the TV model.Finally, we investigated applying DNN directly for LID task, in which a DNN is used to predict the language class for a given frame of speech. Since the entire speech waveform is considered to belong to a single class, we explored using the recurrent neural network (RNN) with long short-term memory (LSTM) structure to model the time sequential information. The results show that RNN is suitable for LID task and has a significant advantage over the feed-forward neural network.
Keywords/Search Tags:Language Identification, Total Variability, ivector, Deep Bottleneck Fea-ture, Deep Learning, Feature Learning, Maximum Mutual Information, Deep NeuralNetwork, Recurrent Neural Network
PDF Full Text Request
Related items