Font Size: a A A

Speech Evaluation Based On Joint Learning Of Speech And Text

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:W L ZhangFull Text:PDF
GTID:2428330611998854Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In real life,there are many scenes that need to evaluate the speaker's speech expression ability,such as Mandarin test,oral training,language teaching evaluation,broadcast host's test and so on.At present,most of these scenes are still evaluated by manual scoring,which is often lack of fairness,time-consuming,cost-effective and low overall efficiency.For language learners,it is also necessary to provide automatic speech evaluation tools for learning feedback at any time.At present,the speech automatic evaluation system usually only refers to the information at the speech level,does not involve the text related content such as semantics and grammar,and cannot reflect the whole information expressed by the speaker.In many scenes,the rater often only gives the speaker an overall score and cannot evaluate it in multi-dimension.In view of the data in general speech evaluation scene,this paper designs a set of standard data preprocessing and feature extraction process.The process consists of three parts: Using voice activity detection technology to remove noise in speech data and improve the quality of speech data.Using speech recognition technology to transcribe and generate text data from speech data,which makes a good foundation for subsequent realization of multimodality.Using data resampling technology to solve the problem of uneven distribution of data labels.The validity of data preprocessing process is designed by means of control variables.The experimental results of data preprocessing process validation show that three data preprocessing processes designed in this paper,including voice activity detection,speech recognition and data resampling,have significantly improved the performance of speech automatic evaluation model.In this paper,a multimodal speech automatic evaluation method based on the joint learning of speech and text is proposed.This method uses two kinds of time series structures named gated recurrent unit networks and long and short-term memory network as the basic framework of the model.The multimodal input structure and the multimodal fusion structure based on the attention mechanism are designed in detail.The experimental results show that in the speech automatic evaluation scenes,the performance of multimodal model based on joint learning of speech and text is better than that of the pure speech model for the text-dependent scoring module.The performance of speech evaluation model using the gated recurrent unit network is higher than that of model using the long and short-term memory network.The performance of speech evaluation models using deep learning methods is significantly higher than that of traditional speech evaluation model using machine learning methods.Aiming multi-parameter speech evaluation scenes,this paper proposes a speech automatic evaluation method based on multi-task learning mechanism,which further improves performance based on multimodal method.In this method,the network structure is shared among multiple scoring tasks,and the speech evaluation model based on multi-task learning is tuned by adjusting the weight between different tasks.In this paper,the scoring module with high relevance is combined for multi task learning.The experimental results show that the multi task learning mechanism effectively improves the performance of fluency module,emotional performance module and rhythm module.This method can greatly assist the multi-parameter speech evaluation scene.
Keywords/Search Tags:speech automatic evaluation, multimodal fusion, deep learning, gated recurrent unit, multi task learning
PDF Full Text Request
Related items