Font Size: a A A

Research On QbE-based Keyword Spotting Technology And System Implementation

Posted on:2022-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhanFull Text:PDF
GTID:2518306569472804Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The number of speech data has exploded,but due to the high cost of transcription,its utilization rate is low.Therefore,how to obtain the parts that users are interested in quickly,accurately,and cost-effectively is the key to improve the utilization of speech data.The Qb E-based keyword spotting has once again attracted more and more attention because it does not need prior knowledge and it is flexible for users.Especially,it has great advantages for low-resource languages.In this paper,a series of researches are carried out on the retrieval efficiency,the accuracy of detection,the reasonableness of matching algorithms and the application of multiple examples in the process of Qb E-based keyword spotting,and mainly explore the effectiveness of the method on the Mandarin test set.Then explore the robustness of this method on Hakka and English test sets.Finally,a Qb E-based keyword spotting system is implemented.The work of this paper is as follows:1.A method of utilizing the skeleton structure information of the query to select keyword candidates in the detected audio is proposed.According to the unvoiced/voiced skeleton structure,the fuzzy search strategy is used to select the regions whose skeleton structures are similar to the query's in the detected audio as candidates for matching.150 examples are used for retrieval in the Mandarin test set which contains 2162 audio files with a total duration of 2.78 h.While the AUC close to S-DTW,the skeleton structure selecting method makes the average real-time detection rate is only 30% of that in the case of S-DTW.2.A method to improve the DTW algorithm using unvoiced/voiced information is proposed.The speech frame type(unvoiced/voiced)and the timing position of the voiced portions are constrained for DTW.These constraints achieve a more reasonable selection of matching pairs by discarding the extremely unreasonable mismatch area and increasing the local distance,which guide DTW algorithm to have a more reasonable physical meaning for the matching path.3.A method for calculating the similarity of voiced portions based on the trend of the fundamental frequency is proposed.The estimated gradient of "four points and three segments" is used to extract the trend of the fundamental frequency characteristics.Then,the similarity score of the fundamental frequency trends of the query and candidates is calculated,which is then merged with the DTW distance score to supplement the tone information in the judgment score of the detected audio.Finally,compared with a single DTW distance score evaluation method on the Mandarin test set,the AUC,P@10 and MAP have been increased by 2.7%,3% and 5.2% respectively.4.A keyword spotting method contains selecting candidates based on unvoiced/voiced skeleton structure information,improved DTW algorithm and fusing voiced portions similarity scores is used.According to the experimental results,compared with the baseline method S-DTW,the AUC and MAP have been increased by 7.9% and 12.4% respectively on the Mandarin test set.On the self-made Hakka dialect test set,the AUC and MAP have been increased by 4.3% and 10% respectively.On the TIMIT test set,the AUC and MAP have been increased by 4.4% and 5.1% respectively.Meanwhile,the average real-time retrieval rate is all about 38% of S-DTW.5.Improved a fusion method of multiple examples.The target template is selected based on the idea of the center of examples distribution,and the examples with large gaps are eliminated according to the DTW distance between the target template and the remaining samples.Then the remaining examples are aligned according to the target template.The experiments show the AUC on the Mandarin test set is 0.88 which is better than other fusion methods.6.A Qb E-based keyword spotting system is implemented,which supports users to select pre-stored examples or recording examples to search for related files in the audio database.The final search results are fed back to users in the form of high to low relevance,and users can play or save the files.
Keywords/Search Tags:Query-by-example spoken term detection, Unvoiced/voiced skeleton structure, Fundamental frequency trend, DTW, Template fusion
PDF Full Text Request
Related items