Font Size: a A A

A Chinese Multimodal Corpus Using Depth Information

Posted on:2019-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:L Y WangFull Text:PDF
GTID:2428330626952399Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the earliest human-computer interaction modes,speech recognition has made great progress in the past decades.Among them,the audio-visual speech recognition technology has significantly improved the effect of speech recognition in the case of audio pollution.Audio-visual speech recognition technology research must have a standard audio-visual corpus as a data base.However,the domestic study of audio-visual corpus is not enough.Large of Chinese audio-visual corpora has poor vocabulary,audio and video quality problems,the quality of the planar images are highly susceptible to the factors such as illumination,the head rotation of the speaker and occlusion.In this paper,depth information is integrated into the Chinese audio-visual corpus.A multimodal data synchronous acquisition system is developed by using Microsoft's second-generation Kinect multi-sensor,and a small corpus is collected in advance.The multimodal speech recognition experiment conducted on the basis of this corpus proved that the deep information is of great help to speech recognition.This paper designed a corpus automatic se-lection algorithm.Finally,146 sentences were selected to serve as the ultimate language materials,covering 78% toneless syllables,93.3% inter-syllabic bi-phones.This paper collected 69 people' multimodal data containing audio,color video,depth images and 3D information in a professional recording room,the final Chinese modal corpus established takes total 22.4 hours' duration and takes up more than 6 TB disk space.Finally,this paper designed the isolated word recognition and continuous speech recognition benchmark experiments based on the multimodal corpus,and analyzes the contribution and value of the depth data to speech recognition research.
Keywords/Search Tags:Speech Recognition, Kinect, Depth Image, Corpus, Multimodal, Phoneme Balance
PDF Full Text Request
Related items