Font Size: a A A

Segmental models with an exploration of acoustic and lexical grouping in automatic speech recognition

Posted on:2016-01-26Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:He, YanzhangFull Text:PDF
GTID:1478390017475879Subject:Artificial Intelligence
Abstract/Summary:
Developing automatic speech recognition (ASR) technologies is of significant importance to facilitate human-machine interactions. The main challenges for ASR development have been centered around designing appropriate statistical models and the modeling targets associated with them. These challenges exist throughout the ASR probabilistic transduction pipeline that aggregrates information from the bottom up: in acoustic modeling, hidden Markov models (HMMs) are used to map the observed speech signals to phoneme targets in a frame-by-frame fashion, suffering from the well-known frame conditional independence assumption and the inability to integrate long-span features; in lexical modeling, phonemes are grouped into a sequence of vocabulary words as a meaningful sentence, where out-of-vocabulary (OOV) words cannot be easily accounted for. The main goal of this dissertation is to apply segmental models - a family of structured prediction models for sequence segmentation and labeling - to tackle these problems by introducing innovative intermediate-level structures into the ASR pipeline via acoustic and lexical grouping.;On the acoustic side, we explore discriminative segmental models to overcome some of the limitations of frame-level HMMs, by modeling phonemes as segmental targets with variable length. In particular, we introduce a new type of acoustic model by combining segmental conditional random fields (SCRFs) with deep neural networks (DNNs). In light of recent successful applications of SCRFs to lattice rescoring, we put forward a novel approach to first-pass word recognition that uses SCRFs directly as acoustic models. With the proposed model, we are able to integrate local discriminative classifiers, segmental long-span dependencies like duration, and subword unit transitions as features in a unified framework during recognition. To facilitate the training and decoding, we introduce "Boundary-Factored SCRFs", a special case of SCRFs, with an efficient inference algorithm. Furthermore, we introduce a WFST-based decoding framework to enable SCRF acoustic models along with language models in direct word recognition. We empirically verify the superiority of the proposed model to frame-level CRFs and hybrid HMM-DNN systems using the same label space.;On the lexical side, morphs, as the smallest linguistically meaningful subword units, provide a better balance between lexical confusability and OOV coverage than phonemes when they are used in recognition to recover OOV words. In this dissertation, we study the use of Morfessor, an unsupervised HMM segmental model specialized for morphological segmentation, to derive morphs suitable for handling OOV words in ASR and keyword spotting. We demonstrate that decoding with the automatically-derived morphs are effective in a morphologically rich language in the low-resource setting. However, grapheme-based morphs do not work well on some of the other languages we evaluate in, due to the over-segmentation and incorrect pronunciation issues among the others. We then develop several novel types of morphs based on phonetic representations to better account for pronunciations and confusability. Morfessor is shown to be able to learn the phonetic regularity for the proposed morphs, achieving improved performance in the languages in which the traditional grapheme-based morphs tend to fail. Finally, we show that the phonetically-based morphs are complementary to the grapheme-based morphs across all languages allowing for substantial performance improvement via system combination.
Keywords/Search Tags:Recognition, Models, Acoustic, ASR, Morphs, Speech, Lexical, OOV
Related items