Font Size: a A A

Dissfluent Event Detection In Chinese Spoken Speech Based On Multiple Instance Learning

Posted on:2020-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X D WuFull Text:PDF
GTID:2415330590473231Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,artificial intelligence is more and more integrated into people's work and life.The intelligent voice interaction between people and machines is more widely used.The machine needs to better understand the voices of people in various situations.At this stage,it mainly includes fluent reading speech,short imperative speech,etc.,and in the aspect of long natural spoken speech,intelligent speech recognition also faces great challenges.This paper mainly foucses on the detection of disfluent events in Chinese natural spoken speech data,which can be regarded as a front-end work of speech recognition tasks.Disfluent events are hesitation,filling pauses,repetitions,etc.in natural spoken language.It is the behavior of abnormal semantics in speech.The research on the detection of disfluent events belongs to the field of paralinguistic speech research.In this paper,prosodic features and spectral correlation features that can well characterize disfluent speech are extracted,and a multi-instance learning model is proposed to solve the problem that disfluent events have short duration and much noise information.Too many problems,identifying disfluent events by training multiple instance learning models.The main contents of the thesis are:1)The construction of a corpus of Chinese spoken disfluency events.This paper extracts the fluent and disfluent voices we need from the existing Harbin Institute of Technology streaming media data corpus.Based on the annotation files in the original corpus,this paper analyzes the annotation features of disfluent events,finds the labeling rules for disfluent events,and then finds an effective method for automatically detecting disfluent events according to this rule,automatically cutting disfluent voices,and then passing The re-examination of the automatically cut corpus has completed the construction of a corpus of Chinese natural oral disfluency events.2)Disfluent voice classification method based on Long Short Term Memory Network(LSTM).In this paper,the LSTM network model is used as the baseline system to identify and detect disfluent events in Chinese natural spoken language.The feature used by the baseline system is the Mel Frequency Cepstrum Coefficient(MFCC)speech feature.Based on the introduction of the structural characteristics of the LSTM network,the audio preprocessing,feature extraction,LSTM model training and testing related to this model are described.Finally,the test results of the recognition model are given.3)Recognition of disfluent events based on the Multiple Instance Learning(MIL)SVM model.In the multi-instance learning,the package is tagged and the various examples in the package are not tagged.It can be said that multi-instance learning is a learning method that combines the characteristics of supervised learning and unsupervised learning.In this paper,multi-instance learning is introduced into the oral event recognition task,and multi-instance learning is combined with the traditional machine learning method Support Vector Machine(SVM)classifier for classification detection.The system uses the feature IS10_Paralinguish Challenge feature set,which not only contains spectral features such as MFCC,but also includes some prosodic features that can well characterize the acoustic characteristics of speech,etc.,which has been shown in the previous speech sub-language recognition.Good results;4)Recognition of disfluent events based on a multi-instance learning neural network model.Multi-instance learning is introduced into the neural network to construct an error function that conforms to the multi-instance learning rules.The neural network has a strong learning ability and high robustness.A neural network model based on multi-instance learning is trained to use the model to test the set.The speech is recognized and the accuracy of recognition is obtained;and the model is improved,and the deep supervision mechanism is added to impr ove the feature learning ability and improve the recognition accuracy.Finally,experiments show that this improved method has a better recognition effect.
Keywords/Search Tags:disfluent spoken speech recognition, LSTM, SVM, multi-instance learning
PDF Full Text Request
Related items