Research On Two Typical Speech Processing Applications Based On Deep Learning

Posted on:2017-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:W J Feng

Full Text:PDF

GTID:2428330569498753

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Deep learning is one of the most advanced research fields in artificial intelligence,and it has made astonishing progress in computer vision,speech processing,robot control,and bioinformatics.Deep learning conducts analysis and learning in a way of simulating human brain,and generates complex concepts by abstracting and combining simple concepts.Comparing with conventional machine learning algorithms,deep learning does not extract hand-crafted features.In this thesis,we studied two typical deep learning based application problems in speech processing,namely audio matching and audio visual speech recognition.From the viewpoint of engineering,audio matching and speech recognition are key technologies of speech processing,and have been widely used in speech retrieval and intelligence analysis.From the viewpoint of theoretical study,audio matching and speech recognition are typical unsupervised problem and supervised problem in speech processing,respectively.Researches on deep learning models for these two kinds of problems are of great academic value.There are following major contributions:First,to improve the generalization capabilities of traditional audio matching methods,this thesis proposed to extract audio features via Convolutional Deep Belief Networks(CDBNs).CDBNs combine advantages of Convolutional Neural Networks(CNNs)which deal with high dimensional data and those of Deep Belief Networks(DBNs)that conduct unsupervised learning,and can extract features with strong generalization capabilities from high dimensional audio data in an unsupervised way.Based on the binary features extracted by CDBN,we proposed a faster audio feature matching algorithm.Experimental results show that CDBN based audio matching algorithm significantly improves the hit rate of audio matching,compared with traditional chroma energy normalized statistics feature based audio matching algorithm.Second,to integrate both temporal characteristics of audio information and video information,this thesis proposed a multimodal Recurrent Neural Network(RNN)framework for multimodal speech recognition.The framework consists of an auditory part for processing audio data,a visual part for processing video data,and a fusion part for combining both the auditory and visual parts.The experimental results demonstrate that the proposed speech recognition system based on multimodal RNN successfully combines video features and audio features,and effectively improves speech recognition accuracy based on audio data only,especially on the low SNR dataset.

Keywords/Search Tags:

Deep learning, speech processing, audio matching, audio visual speech recognition

PDF Full Text Request

Related items

1	A Study On Bimodal Audio Visual Speech Recognition Based On Deep Learning
2	Speech Endpoint Detection Based On Audio And Visual Features
3	The Methods Of Deep Audio-visual Speech Recognition
4	Robust and efficient techniques for audio-visual speech recognition
5	Audio-Visual Speech Recognition And Its Applications
6	Research On Noise Treatment Of Speech Recognition With Lip-movement Information
7	Key Technology Research On Audio Information Hiding And Information Security Application For Speech Recognition
8	Research On Audio-Video Information Processing Based On Lip-Changing
9	A multimodal sensor fusion architecture for audio-visual speech recognition
10	Based On An Audio Match Of The Smart Broadcast Advertisements