Font Size: a A A

Speech Endpoint Detection Based On Audio And Visual Features

Posted on:2008-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ChenFull Text:PDF
GTID:2208360215466534Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With more and more beging widely used in people's daily life,people also attachs great importance to speech recognition.In recent 50 years, speech recognition has made big progress, especially has gotten very high accuracy.It has been applied in a lot of fields such as education,commerce,telephone voice dialing, and all kind of audio fields,sound toy and so on. It will used in all fields gradually.The basic function of speech recognition is to detect the endpoint of speech by digital processing technology.Its goal is to distinguish the voice signal and the non-voice signal under the complex noise environments.It is one of the most important technology in speech recognition, and its performance will directly affect the speech recognition accuracy.The traditional speech recognition simply pays attention to the sound signal solely,the biggest problem is that the robustness is very slow,the accyracy of these recognition system obviously reduces rapidly under the strong noise environment,and the application of the speech recognition receives very big restriction.The process of pronunciation not only arises the sound signal production, but also simultaneously follows the vocal's movement, especially the lip's movement, therefore there is a inevitably innner connection between the audio singnal and the video singnal of speech, the audio characteristic and the visual characteristic have very good complementarity and redundancy.It will help boost the accuracy of speech recognition under noise environment, and this is the biggest contribution in this article.The chapter 1 is the exordium.In this chapter, we give the conception of speech endpoint detection, the disadvantage of traditional audio endpoint detection and the significance we use visual feature in speech endpoint detection.The chapter 2 describes the traditional endpoint techology based audio, and give an arithmetic of speech endpoint detection based on time-frequency and freqnency-domain method We can detect 4 states:no-voice state, transitional state, voice state and ending state, and describe conversion relation among them.In chapter 3, we describes the reasons why we introduce visual feature into speech endpoint detection,and we also describles outline of face dectection.And give a detailed description about these arithmetics,such as characters group analysis, active shape models(ASM for short) and linearity sub-space method.We also take 2 factors(illumination and movement of head) that affects the visual character extraction into account. And we also present the algorithm, which can detect the face from video picture, and extract lip from the picture.And the last we present a algorithm(named division and unition method)based character method and linearity sub-space method, and give a detailed describtion.In chapter 4, we give 3 methods that can detect the endpoint from the visual character:picture comparison method(compare 2 pictures and find the different degress of 2 pictures); FAP method (find the FAP and give the states of the lip)and lip movement function method(give a function that describes the movement of the lip),and finally give detailed descrtion about them for each.A number of audio-visual fusion schemes on state level are discussed in chapter 5, and when the noise is slow we accord the audio character as main function and when the noise is high, we accord the visual character as main function,and when they are same, we combine the 2 methods together to detect the endpoint.And we find by experiments that the accuracy rate is higher than that of audio or video method simply.In chapter 6, we give a summarize to the thesis, and present some problems on the digital voice endpoint detection, which remains to be resolved.And last we point out some possible research directions in the future.
Keywords/Search Tags:speech recognition, audio-visual speech recognition, endpoint detection, video feature, visual feature, face recognition, audio-visual fusion, facial animation parameter
PDF Full Text Request
Related items