Font Size: a A A

Research On Text-Audio Alignment

Posted on:2010-09-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y TaoFull Text:PDF
GTID:1118360302483790Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Speech is one of the most important forms of human communication, while the text is a type of sign used to record the speech. With the recent popularity and increase of multimedia content, and the rapid development of networks and Automatic Speech Recognition (ASR) technology, the problem of aligning audio files to their corresponding textual transcripts becomes an important research direction in web indexing technology, computer-assisted language learning and etc. Built on the key technique of ASR, text-audio alignment is to align the transcribed data with the speech data, identifying which time segments in the speech data correspond to particular words/phones in the transcription data. It has been widely applied in model training, speech evaluation, media information retrieval, broadcasting/TV publication and etc.Recent research focuses on the effectiveness and robustness that makes the alignment be tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks. This paper proposed a systematic and integrated approach to address the problem of aligning very long, and possibly noisy, even highly imperfect speech signals to their associated transcript, and discussed its applications in a speech evaluation system. The contributions of the paper mainly include:1. A fuzzy logic based approach to label speech segments from media files is proposed. A fuzzy inference system was introduced to fuzzify the features extracted from different dimensions. By applying predefined rules, an output that represents the degree a clip belonging to speech is computed and used in speech/non-speech classification. We show that this algorithm improves the accuracy of the speech segments detection, compared with the single feature based VAD algorithms.2. An extended alignment network has been introduced to allow word/phone level errors of insertions, substitutions and deletions, instead of forcing the recognizer to align the exact string present in transcription. Experiments show improvements on the alignment accuracy. 3. A dynamic alignment approach for long and imperfect speech and the corresponding transcription has been proposed. The algorithm gets started with multi-stage sentence boundary detection in audio, followed by a dynamic programming based search, to find the optimal alignment and detect the mismatches at sentence level. Experiments show promising performance, compared with the traditional forced alignment approach.4. A text-speech alignment engine has been designed and implemented based on the proposed algorithms. We present the detail technologies and discuss its applications in the areas of preparing multimedia content and speech evaluation.The platform based on this alignment engine has been deployed on TALKPALTM English training platform, which are now provide service to global users.
Keywords/Search Tags:text-audio alignment, speech evaluation, fuzzy logic, dynamic planning, voice activity detection
PDF Full Text Request
Related items