Font Size: a A A

PROSODY AND SPEECH RECOGNITION (ARTIFICIAL INTELLIGENCE)

Posted on:1987-04-02Degree:Ph.DType:Thesis
University:Carnegie Mellon UniversityCandidate:WAIBEL, ALEXANDERFull Text:PDF
GTID:2478390017959029Subject:Computer Science
Abstract/Summary:
Although numerous studies have demonstrated that prosody is critical to human speech perception, many automatic speech recognition systems process only spectral/phonetic cues. They ignore or deliberately remove prosodic cues such as pitch, intensity, rhythm, temporal relationships, and stress. Extending speech recognition systems to human performance levels, however, will require exploiting all available cues and sources of knowledge.;Major contributions of this research include several implemented knowledge sources and insights for further application of prosodic information to speech understanding. For lexical access, temporal knowledge sources restrict the word candidate set by 50% to 93%. Intensity- and stress-based knowledge sources also each reduce possible word candidate sets by about 50%. The lexical, prosodic knowledge sources were combined and compared with a phonetic word hypothesizer currently under development. The results show that the average rank of the correct word hypothesis can be reduced to almost 1/3 when prosodic knowledge sources are added to the phonetic word hypothesizer. At the sentential level of processing, a pitch contour knowledge source reduces syntactic and pragmatic ambiguity by discriminating between statement and question "tunes". We have examined the role of stress at distinct levels of speech processing. At the acoustic/phonetic level, we have reevaluated phonemic and phonetic consistency in stressed and unstressed syllables in terms of phonetic substitution and omission rates. Our results indicate that speaking rate, more than stress, predicts the rate of segmental omissions. At the syntactic level, automatically detected stress levels provide an acoustic cue to distinguishing between content words and function words in a spoken sentence.;This work demonstrates the power of prosodic constraints in computer speech recognition systems. We first show theoretically that prosodic patterns can discriminate between words of large vocabularies (vocabularies the size an adult typically commands). We then introduce several novel algorithms to extract prosodic parameters reliably. These parameters include segmentation algorithms for detecting syllable boundaries and major segment boundaries and algorithms for measuring pitch and intensity contours, and lexical stress levels. Extensive performance evaluation of these algorithms is presented. We then implement and evaluate prosodic knowledge sources that apply the extracted parameters at appropriate processing levels including the lexical, syntactic and sentential levels. To permit large vocabulary capability, the knowledge source designs emphasize a concern for minimizing lexical search, exploiting parallelism and speaker-independent and/or template-independent operation.
Keywords/Search Tags:Speech recognition, Knowledge sources, Lexical
Related items