Font Size: a A A

Research On Syllable Lattice Based Chinese Spoken Document Retrieval Method

Posted on:2009-02-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:T R ZhengFull Text:PDF
GTID:1118360278962067Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology and multimedia technology, more and more speech data are avaliable worldwide via the internet. For the rapidly growing need to efficiently organize and analyze those data, context based spoken document retrieval technology is a key issue. The task of spoken document retrieval (SDR) can be described like this: according to the queries given by a user, all the files or pieces including relevant speech contexts are found and listed from a large collection of multimedia documents. In spoken document retrieval, speech recognition is always adopted to index documents, however, its high error rate and missing out of vocabulary (OOV) words in recognition results also limit retrieval performance. Thus, subword lattice based retrieval methods are investigated to avoid the problem of OOV words and compensate retrieval performance loss resulted by recognition error. For Chinese, syllable lattice based retrieval technology is widely used by researchers.A key problem of syllable lattice based approach is that lattice is difficult to be indexed. Its directed graph structure and mixed contents consist of correct candidates and wrong candidates, not only result in very low retrieval accuary for traditional retrieval methods, but also need much more index space and searching time. Thus, the retrieval methods, which are suitable for syllable lattice and have balance performance for retrieval accuracy, indices size and retrieval speed, will be valuable and important research work.Three Chinese spoken document retrieval methods with different indexing and searching technology are firstly proposed in this thesis, in order to develop different performance bias. Then considering that retrieval performance is also restricted by error rate low-bound of lattice, two accuracy improvement methods based on lower error rate bound are studied. Concretely speaking, this thesis is arranged as follows:1)Word Spotting based retrieval method is proposed, in which syllable lattice is directly stored as indices, word spotting algorithm is separated to an online part and an offline part to implement retrieval tasks, and word frequence and word confidence score are combined in similarity measure. Though higher accuracy is acquired, which is even closed to the retrieval accuracy on the best alternatives of lattice, but indices size and retrieval speed are not good enough to afford the retrieval tasks of large collection. A removing redundancy method is also proposed, which can distinguish useful information from redundant information by a syllable posterior probability histogram and then remove redundancy from lattice indices. Experiment shows that smaller indices size and faster searching time are acquired by using the removing redundancy method.2)Syllable inverted index based retrieval methods are proposed, in which indices size can be effciently reduced. In order to improve accuracy, two matching methods that can relax path limitation in searching stage are investigated: time based matching method and position based matching method. In position based matching method, syllable lattice is explained as a sequence of some competition sets and then position specific posterior probability is calculated for all candidates. According to rank lists in the competition sets, a similarity weighting method is studied. Experiment shows that two matching methods both improve accuracy a little, in which position based matching method is better and rank weighting can improve accuracy more. A posterior probability based prunning method is also present to speed the retrieval process.3)In order to build indices in document level , a neighbor syllable posterior probability matrix based retrieval method is proposed, which can improve index size and retrieval speed substantially so as to meet the need of the SDR tasks with large-scale corpus. K step neighbor syllable pairs is introduced to represent long distance correlation and neighbor posterior probability matrix is adopted to represent the contents of lattices. Posterior probability of neighbor syllable pairs in documents is calculated and a neighbor syllable posterior probability matrix built in document level is taken as document index. Experiment shows that though accuracy fall 5%, its peformance of index size and retrieval speed is comparable to text retrieval approach. Prosody is adopted to weight similarity measure. Three prosodic weighting methods are investigate, in which energe based weighting method get the best result, 2.7% of accuracy is improved.4)The limitation of accuracy improvement is explored and two accuracy improvement methods based on lower lattice error rate bound are proposed, one is based on extended lattice, the other is based on word fragment language model. Extended lattice based approach improve lattice error rate by estimating the probability of the syllables lost by recognizer, by which lattice error rate falls 1.7% and 4% of accuracy is improved. Word fragment based approach improve lattice error rate by introducing higer semantic level unit to speech recognizer.
Keywords/Search Tags:Chinese spoken document retrieval, syllable lattice, syllable inverted index, neighbor syllable posterior probalitity matrix, lattice error rate low-bound
PDF Full Text Request
Related items