Font Size: a A A

Research On Acoustic Feature Analysis In Audio Retrieval

Posted on:2016-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:1108330479495101Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of internet technology and popularization of handheld filming and recording devices, there is an explosive growth of multimedia data easily accessible by domestic users. For the purpose of management and retrieval of such data, content-based retrieval becomes a hot research issue. It requires users to upload the query clip which represents their intentions. The systems extract representative low-level features from the query clip and search similar samples from database. This method relies on distinguishing features and efficient searching algorithms. This study focuses on the audio track in multimedia and aims at audio-content-based retrieval. The main work of this thesis is the several proposed low-level acoustic features based on the perception process in human brain and sparse representation. Utilizing such features, an inverted-index-based audio content retrieval algorithm is proposed. The main contributions of this thesis are as follows:(1) Based on the properties of perception process of harmonic signals in human brain, proposed is a Harmonic-Component-based spectrum decomposition method. A dictionary is designed to describe harmonic components in spectra. It utilizes its parameters including fundamental frequency, formant and overtones energy decay rate to describe the harmonic structure. Based on the proposed dictionary and matching pursuit algorithm, the signal is decomposed into the form of sparse representation. The statistics of parameters of the decomposed atoms are used as features. The experimental results show that the proposed feature has a recognition accuracy of 64.8% in 16-class closed-set audio effects classification. Compared with MFCC and spectral features, the proposed feature improves the accuracy by 7.4% and 3.9% respectively. When combining with MFCC feature, the recognition accuracy reaches 66.3%.(2) Features extracted from spectral domain in(1) has the disadvantage of low time resolution. Proposed is the temporal decomposition model based on human brain perception. The proposed model is based on the perception process of harmonic, transient and residual components in human brain to decompose the signal into these sub-spaces. Components in individual sub-space are represented by joint time-frequency features. The harmonic sub-space is described by Gabor dictionary which has a fine time-frequency resolution. The transient sub-space is characterized by Gammatone dictionary which is consistent with the response of human ear. The residual sub-space is represented by noise color. The experimental results show that the proposed feature has a recognition accuracy of 72.3% in 16-class audio effects classification. It outperforms MFCC, MFCC+MP and MFCC+MAXMP features by 14.9%, 6.2% and 4.7% respectively.(3) Features using conventional coefficient-vector-based sparse representation in(1) and(2) fail to characterize atom parameters. Proposed is a sparse representation based on the high dimensional coefficient tensor. The tensor utilizes its different dimensions(modes) to represent atom parameters. Thus the tensor jointly describes the time, frequency and duration information of Gabor components in signal, leading to the joint time-frequency-duration representation of the signal. Furthermore, a non-negative sparse tensor decomposition algorithm is proposed exploiting the sparseness of tensor as penalty parameter to avoid over-fitting. The time, frequency and duration factors decomposed from tensor are used as features. The proposed feature achieves 82.2% classification accuracy in 16-class closed-set audio effects classification, and a 20.4% EER value in open-set verification.(4) Traditional audio retrieval using sequential searching suffers from heavy calculation burden. Proposed is an inverted-index-based audio content retrieval method including audio content segmentation, semi-supervised dictionary training and similarity measure. First a noise robust and fast speaker change detection algorithm based on nonadjacent data windows is proposed. Then a multiple layer audio content segmentation algorithm is presented using the proposed speaker change dectection algorithm. A semi-supervised dictionary training method is proposed to convert the segmented audio clips into audio words. Then an inverted index is constructed according to text retrieval. In the phase of searching, to sort the candidate segments, both content and sequence matching between query clip and candidate segment are involved. The results show that with the query duration of 20 seconds, the proposed method achieves the precision of 95.68%. It outperforms sequential searching methods TAS and MOTS by 2.82% and 1.37%, and outperforms bag-of-audio-word method by 18.77%. But the calculation time of the proposed method is only 66.26%, 35.50% and 75.93% of them.
Keywords/Search Tags:audio signal processing, content-based audio retrieval, audio effects classification, acoustic features, sparse representation, audio segmentation
PDF Full Text Request
Related items