Research On Text-Audio Alignment

Posted on:2010-09-17

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Tao

Full Text:PDF

GTID:1118360302483790

Subject:Computer software and theory

Abstract/Summary:

Speech is one of the most important forms of human communication, while the text is a type of sign used to record the speech. With the recent popularity and increase of multimedia content, and the rapid development of networks and Automatic Speech Recognition (ASR) technology, the problem of aligning audio files to their corresponding textual transcripts becomes an important research direction in web indexing technology, computer-assisted language learning and etc. Built on the key technique of ASR, text-audio alignment is to align the transcribed data with the speech data, identifying which time segments in the speech data correspond to particular words/phones in the transcription data. It has been widely applied in model training, speech evaluation, media information retrieval, broadcasting/TV publication and etc.Recent research focuses on the effectiveness and robustness that makes the alignment be tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks. This paper proposed a systematic and integrated approach to address the problem of aligning very long, and possibly noisy, even highly imperfect speech signals to their associated transcript, and discussed its applications in a speech evaluation system. The contributions of the paper mainly include:1. A fuzzy logic based approach to label speech segments from media files is proposed. A fuzzy inference system was introduced to fuzzify the features extracted from different dimensions. By applying predefined rules, an output that represents the degree a clip belonging to speech is computed and used in speech/non-speech classification. We show that this algorithm improves the accuracy of the speech segments detection, compared with the single feature based VAD algorithms.2. An extended alignment network has been introduced to allow word/phone level errors of insertions, substitutions and deletions, instead of forcing the recognizer to align the exact string present in transcription. Experiments show improvements on the alignment accuracy. 3. A dynamic alignment approach for long and imperfect speech and the corresponding transcription has been proposed. The algorithm gets started with multi-stage sentence boundary detection in audio, followed by a dynamic programming based search, to find the optimal alignment and detect the mismatches at sentence level. Experiments show promising performance, compared with the traditional forced alignment approach.4. A text-speech alignment engine has been designed and implemented based on the proposed algorithms. We present the detail technologies and discuss its applications in the areas of preparing multimedia content and speech evaluation.The platform based on this alignment engine has been deployed on TALKPAL^TM English training platform, which are now provide service to global users.

Keywords/Search Tags:

text-audio alignment, speech evaluation, fuzzy logic, dynamic planning, voice activity detection

Related items

1	Research On Automatic Speech-Text Alignment For Mongolian Long Audio
2	Research And Implementation Report-oriented Voice Activity Detection
3	Research And Application Of Speech And Text Automatic Alignment Technology Based On Text Similarity Algorithm
4	Research On Voice Activity Detection Based On ACAM And Traditional Classification Model
5	High Robust Low Power Voice Activity Detection Design Based On DNN
6	Research Of Effient Speech Enhancement And Voice Activity Detection
7	Research On Voice Activity Detection Algorithm In Low SNR
8	Voice Acitivity Detection With Deep Learning
9	Multi-speaker Recognition Based On Audio Video Information Fusion In Meeting Room Environment
10	Speech Synthesis for Text-Based Editing of Audio Narratio