Research On Chinese Spoken Term Detection Technology For News Corpus

Posted on:2013-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:K W Wang

Full Text:PDF

GTID:2268330392967985

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Spoken term detection (STD) returns relevant segments from a given corpus of speechdata according to usersâ€™ queries which are in text form. STD is an important area ofspeech recognition and has broad application prospects. The design of STD system isusually implemented in two stages: off-line indexing and online searching. Obviously,the accuracy of the STD system is highly related to the quality of the index.Indexing is usually based on the output of the ASR system. The indices of mostSTD system are based on lattice, which is the output of the speech recognition. Thelattice has reasonable structure and contains plentiful of information. The probabilityof the local path through the lattice can be obtained according to the acousticlikelihood and language model and such information is kept in the lattice. Itâ€™s a simpleand effective way to take this probability as confidence measure when indexing. Asthe traditional N-gram model (i.e. the bigram model) does not consider the syntacticand semantic constraint of further words, it misses some information. The longdistance bigram model in this paper captures different aspects of the syntactic andsemantic constraint between words, the STD system based on the lattice and the longdistance bigram other than the traditional N-gram model will improve the quality ofthe indices and the performance of the system. Our experiments consider theperformance of the STD systems based on different distance of bigram anddemonstrate that, when integrating results from systems based on different distances,we can get higher detection recall over system based on traditional N-gram models.News corpus is an ideal choice of constructing speech recognition system in STDsystem for news databases. In the front of the STD system, the input speech needs tobe converted into text by a speech recognition system. But commercial news corpus atpresent does not have a detailed transcript. The transcript is of paragraph level notphrase level. It cannot be used when doing recognition task. This paper presents anautomatic method of segmenting the speech of paragraph level based on speechrecognition. The method constructs a linear recognition network first of all, theninserts silence models between short speech utterances, finally does decodingprocessing over the speech. The experiments demonstrate that this method shows fineperformance when splicing segments of paragraph level less than11minutes. Weconclude that it is an effective method of splicing paragraph level speech.

Keywords/Search Tags:

spoken term detection, news corpus, lattice, n-gram model, long distancebigram model, automatic corpus splicing

PDF Full Text Request

Related items

1	Research On WFST Based Spoken Term Detection
2	Construction And Application Of The Northeastern Native Spoken Language Corpus
3	Study And Implementation Of Content-based Mandarin Spoken Term Detection System
4	Chinese New Word Identification Based On Large-scale Corpus
5	An Automatic Chinese Text Categorization System Based On Statistical Language Model
6	Deep Learning For Spoken Term Detection
7	Design And Implementation Of Automatic Construction System Of English-chinese Parallel Corpus
8	Research On Lattice Based Spoken Document Retrieval
9	Research On Named Entity Equivalents Automatic Acquisition Method Based On English-Chinese Parallel Corpus
10	Researches On Technologies Of Diglossia Parallel Corpus Selection Automation For Statistical Machine Translation