Font Size: a A A

Research On Web-based Mandarin News Retrieval

Posted on:2015-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2308330473450327Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Under the background of big data, how to retrieval interested contents quickly and accurately from a large-scale audio database has been a problem expected to be tackled urgently. It’s unfeasible to solve this problem via conventional text-based audio retrieval, while content-based audio information retrieval has draw more and more attention.This study mainly focuses on the audio data of mandarin broadcast and TV news and explores Example-based mandarin news retrieval and LVCSR-based mandarin news retrieval in the field of audio information retrieval. Both of them aim at automatic search of user-provided queries from news audio database and returning relevant audio segments. However, query of the former is audio excerpt while query of the latter is keyword or phrase, corresponding to audio query by example and spoken document retrieval respectively. The main innovations and contributions of this study are described as follows:1. Example-based mandarin news retrieval algorithmDuring the process of audio query by example, it’s complex to extract high-dimensional feature vectors and the computation of similarity is always large. To address these problems above, this study tries to explore an algorithm of spectrogram based audio fingerprint extraction and finally verifies it. Combining with audio fingerprint feature, the paper proposes a retrieval algorithm based on inverted index, which could overcome the defect of forward index method and significantly improve the retrieval speed. Meanwhile, this study offers an improved retrieval algorithm based on speech activity detection, which could effectively avoid the negative impact on retrieval speed caused by long input audio example.2. LVCSR-based mandarin news retrieval algorithmFirstly, this study proposes a text-independent story segmentation of broadcast and TV news algorithm. Different from the mostly existing algorithm using speech recognition transcripts, it detects story boundaries directly from audio streams and gets good result. Secondly, story segment should be processed before it is used as input of LVCSR system. Front-end processing algorithms of LVCSR system are studied in this thesis. At the same time, in order to make full use of speaker information brought by speaker segmentation and clustering and improve the accuracy of audio segmentation, a two-stage news audio segmentation algorithm is adopted. On the basis of audio pre-processing, a large vocabulary continuous speech recognition system for mandarin news audio is established. Lastly, for retrieval algorithm, this thesis realizes a Lucene based full-text retrieval algorithm and proposes a word vector based relevant keywords or phrases recommendation algorithm. With this recommendation algorithm, LVCSR-based mandarin news retrieval system could return words or phrases related to query terms.3. Development of mandarin news retrieval systemBased on the studies and improvements above, we have finally designed and implemented a mandarin news retrieval system using web interface. This system not only exhibits efficient retrieval performance in Example-based and LVCSR-based mandarin news retrieval, but also provides user-friendly interaction and could be applied in practice.
Keywords/Search Tags:broadcast and TV news, audio query by example, inverted index, spoken document retrieval, story segmentation
PDF Full Text Request
Related items