Font Size: a A A

Research On Two Typical Speech Processing Applications Based On Deep Learning

Posted on:2017-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:W J FengFull Text:PDF
GTID:2428330569498753Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep learning is one of the most advanced research fields in artificial intelligence,and it has made astonishing progress in computer vision,speech processing,robot control,and bioinformatics.Deep learning conducts analysis and learning in a way of simulating human brain,and generates complex concepts by abstracting and combining simple concepts.Comparing with conventional machine learning algorithms,deep learning does not extract hand-crafted features.In this thesis,we studied two typical deep learning based application problems in speech processing,namely audio matching and audio visual speech recognition.From the viewpoint of engineering,audio matching and speech recognition are key technologies of speech processing,and have been widely used in speech retrieval and intelligence analysis.From the viewpoint of theoretical study,audio matching and speech recognition are typical unsupervised problem and supervised problem in speech processing,respectively.Researches on deep learning models for these two kinds of problems are of great academic value.There are following major contributions:First,to improve the generalization capabilities of traditional audio matching methods,this thesis proposed to extract audio features via Convolutional Deep Belief Networks(CDBNs).CDBNs combine advantages of Convolutional Neural Networks(CNNs)which deal with high dimensional data and those of Deep Belief Networks(DBNs)that conduct unsupervised learning,and can extract features with strong generalization capabilities from high dimensional audio data in an unsupervised way.Based on the binary features extracted by CDBN,we proposed a faster audio feature matching algorithm.Experimental results show that CDBN based audio matching algorithm significantly improves the hit rate of audio matching,compared with traditional chroma energy normalized statistics feature based audio matching algorithm.Second,to integrate both temporal characteristics of audio information and video information,this thesis proposed a multimodal Recurrent Neural Network(RNN)framework for multimodal speech recognition.The framework consists of an auditory part for processing audio data,a visual part for processing video data,and a fusion part for combining both the auditory and visual parts.The experimental results demonstrate that the proposed speech recognition system based on multimodal RNN successfully combines video features and audio features,and effectively improves speech recognition accuracy based on audio data only,especially on the low SNR dataset.
Keywords/Search Tags:Deep learning, speech processing, audio matching, audio visual speech recognition
PDF Full Text Request
Related items