Chinese Spoken Document Retrieval Method Based On Stop-word Processing

Posted on:2009-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:B Jiang

Full Text:PDF

GTID:2178360278464767

Subject:Computer Science and Technology

Abstract/Summary:

As the development of Internet and multi-medial techonology, the amount of spoken doucuments has increased rapidly. An effective retrieval method for spoken doucuments becomes more and more important. As a new field of speech recognisition ,the aim of SDR (spoken document retrival)is to search in the collection of spoken documents and return the query-related spoken doucuments segmentation or spoken documents to users. Based on indexes of spoken documents created beforehand, it can search effectively based on content.This paper investigates the strategy to improve the performance of Chinese SDR.according to frequent occurency of stop-word in spoken documents, this paper introduces the technology of stop-word processing to SDR.Stop-word is defined as these words that appear frequently in documents but make no sense for retrieval. There must be negative influence to the performance of SDR because of the introduction of non-content stop-word. Because of the particularity of SDR, this paper applying the entropy mehod to extract stop-word,designed the algorithm of stop-word extraction. comparing with the word-frequency mothod ,this method has better performance and reflects the context better.this paper supplys a whole on-line processing of spoken document retrieval, which includes the creation of index based on syllable lattice, the similarity calculation between query and spoken document based on vector space model, orders the result according to similarity and outputs results to users. Every spoken document is presented by a feature vector, which is constructed based on syllable lattice. Extracting the acoustic score of syllable and syllable-pair from every spoken document by searching every syllable lattice of spoken documents to form the feature vectors of spoken documents. Because of the error rate of the ASR(automatic speech recognizer) and multiple characters per syllable, we weighted syllables of stop-word by a punished value to reduce the weight of stop-word syllable in the feature vector, the value is set 0.1 through comparing retrieval results of different value. The cosine similarity is used to estimate the relevance between the query and the document. By experiments, the improved system has a good improvement compared with the baseline system.The main contributions of this paper are: proposing the stop-word extracting algorithm based on left-right entropy, extracting stop-word properly from syllable lattice .proposing the improved VSM based on stop-word punishment and improving the performance of retrieval system.

Keywords/Search Tags:

Chinese SDR, stopword, entropy, syllable lattice, VSM

Related items

1	Research On Syllable Lattice Based Chinese Spoken Document Retrieval Method
2	Research On Chinese Syllable Evaluation Approach After Automatic Speech Recogniton
3	The Research About Keyword Spotting Of Garbage Model Based Syllable Lattice
4	Syllable-based Method Of Tone Recognition For Chinese Continuous Speech
5	Research On Chinese Speech Recognition Algorithm Based On Syllable Modeling
6	Research And Application Of The Enhancement Method For Initial Of Chinese Syllable
7	Key Issues Of Spoken Document Retrieval Based On Syllable-Fragment Lattice
8	Research On Mandarin Spoken Document Retrieval Based On Lattice
9	Research On Key Technologies For E-Commerce Oriented Web Usage Mining
10	Research And Implementation Of Keyword Spotting System With Large Keyword Table In Spontaneous Speech