Font Size: a A A

Automatic Speech Recognition

Posted on:2021-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Muhib UllahFull Text:PDF
GTID:2428330611966325Subject:Electrical and Computer Engineering
Abstract/Summary:PDF Full Text Request
The purpose of Automatic Speech Recognition(ASR)system is to transcribe a continuous acoustic signal onto text and extract linguistic information from acoustic stream.Current ASR systems are proficient in transcribing continuous speech with Word Error Rate(WER)of 10% to 5.7%.During last two decades,most of these systems used Hidden Markov Models(HMMs)in conjunction with Gaussian Mixture Models(GMMs)to model the phonetic units corresponding to the words.HMMs model the temporal variability of speech and GMMs calculate parameters of the acoustic input in each of HMM states.HMMs provide effective way for modelling time-varying spectral vector sequences.Recently Deep Neural Networks(DNNs)are used to determine the fit for each HMM state given acoustic input.The processing of acoustical coefficients of the uttered speech can be classified into two types of pre-processing domain:(a)spectral based parameters(b)dynamic time series.Spectral based parameters is mostly used pre-processing domain.Mel Frequency Cepstral Coefficients(MFCCs)is commonly used approach spectral based approach in recognition task.We choose spectrogram(MFCCs)as preprocessing scheme for uttered speech,although transcribing raw utterance with Recurrent Neural Network(RNN)or restricted Boltzmann machines(RBM)is possible but due to high computational cost the performance might become worse.Recurrent connections have been applied to certain hidden layer of feed-forward network allowing the model to capture the temporal dependencies.Mini-batch gradient descent(MGD)method has chosen and back-propagation through time(BPTT)algorithm is applied to make RNN more efficient and effective in training acoustic model.Finally,we address vanishing gradient and exploding gradient of the basic RNN problems by presenting advanced version of RNN;Gated Units of RNN architecture – Long-Short Term Memory(LSTM).The major contribution of this work includes;the presentation of detailed research of speech recognition system for Large Vocabulary Continuous speech where Acoustic Model(AM)for context-dependent-phone was combined with Pronunciation dictionary and Language Model(LM).In addition,to make recognition process faster we explore multi-pass techniques for movable token.The beam search approach is used to increase computational efficiency by narrowing down the search space.The second contribution of this work is;we show that instead of increasing the number of units in each layer of RNN network,introducing advanced version of RNN – Long-Short-Term-Memory units,could be used which gives better performance.Moreover,LSTM-based recurrent network make efficient use of model parameters to remember long term sequences.LSTMs shows dominance over standard RNN in term of long term contextual dependencies which were considered impossible.
Keywords/Search Tags:ASR, Hidden Markov Model, MFCCs, Acoustic Model, RNN, LSTM, BPTT
PDF Full Text Request
Related items