| Text-Speech Alignment, based on the technique of Automatic SpeechRecognition, is a process of aligning the speech and text in time. In recent years, withthe rapid development of the internet, more and more speech and text data areavailable, aligning these text and speech in time is the key to use them. So theText-Speech Alignment attract more researchers’ interests.The Text-Speech Alignment is the key technology in the field of speechrecognition. The conventional method use the speech recognizer to recognize thespeech to get the recognition result including the time information, and align it withthe original text to get the common part, then use this part to get the correspondspeech. These aligned data are used to train the acoustic model, evaluate speech, buildthe corpus automatically and multimedia information retrieval and etc. To improvethe accuracy and robust of this technology, we need a speech recognizer trained by alarge number of labeled data, which are obtained by using Hugeamounts of labor power as well as material, financial resources and time.This paper discusses the overseas and domestic research status and proposes aText-Speech Alignment algorithm, which is independent of the speech recognizertrained by lots of labeled data. Using this algorithm, we can obtain the aligned dataautomatically, the use them to train an continuous speech recognition system based oncontext-dependent tri-phone to show the application.The contributions of this paper mainly include the following aspects.Firstly, to rid the dependency of the labeled data, we propose a Text-SpeechAlignment based on the open speech recognizer (Google Voice Recognition, GVR)and the language model, which is constructed using the finite state automation. Usingthis algorithm, we can get the aligned speech-text data automatically. In particular, thefirst job is to submit the speech data to GVR to obtain the recognition text, but thistext doesn’t include the time information, which is the key element of Text-SpeechAlignment. To get the time information, we recognize the speech again. In this recognition, the acoustic model is trained by the original speech and text data, and thelanguage model is based on the finite state automation. By this recognition, we get thetime information and complete the Text-Speech Alignment.Next, using the data obtained in the last step to train an acoustic model, thenutilize and improve the SailAlign algorithm based on this trained acoustic model toalign the speech and text data effectively to finish the corpus construction. It has beendemonstrated that the accuracy of this alignment can be increased to95%under thetext noisy is10%or less.At last, an continuous speech recognition system based on context-dependenttri-phone is constructed to test the performance of the proposed Text-SpeechAlignment algorithm. And in the feature extraction step, we add the pitch into thefeature, because of its good discrimination to the voiced and unvoiced sounds, therecognition accuracy is higher than the recognizer only based on the MEL cepstrumparameters. |