Font Size: a A A

Research On Audio Scene Recognition Based On Anchor Space

Posted on:2012-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2218330362450464Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of modern information technology, especially network multimedia technology, digital signal processing technology, more and more voice signal is digitized, and stored in a variety of audio formats. Based on this, people urgently need an effective method to recognize the content of audio from the audio data streams in order to efficiently use the audio resources and supply suggestions for intelligent systems.Audio scene is an audio fragment composed of several acoustic events which are relevant in semantic and adjacent in time-domain. This audio fragment always contains high-level abstraction conception and specific semantic expression. Audio scene recognition is to recognize and understand the audio semantic content in high level, this technology is widely used in domains of information content security, smart surveillance, unmanned vehicle, smart meeting rooms, and so on. The traditional audio scene recognition methods, such as, Gaussian Mixture Model, model and recognize in short time, give a final response in long time according to the goodness in short time. This method not only neglects the distribution feature in long time of the acoustic content, but also fails to recognize the situation of chaos both in target acoustic event and non-target acoustic event. This paper proposes three audio scene recognition methods to model in long time based on anchor space, and designs a recognition task which to find the excited audio scene fragment in entertainment programs to test the performance of the three methods. The excited audio scene fragment means the intense applause acoustic events and laugh acoustic events.Anchor can be seen as a prototype of a category, a connection of vector formed by input signal mapping into category. This paper proposes three methods to construct the anchor space, and also designs corresponding audio scene recognition methods. Firstly, anchor space is based on the statistics of state changes. This method transforms the changed magnitude from audio feature in time sequences into changed state. The anchor space gets from the statistics of changed state. The projection of each target audio file forms one anchor vector which can be seen as one model of target scene, further more, all anchor vectors forms the target scene library; Secondly, anchor space is based on the Gaussian Mixture Model. The target audio data from the train data trains a target GMM while the non-target audio data trains a non-target GMM. The anchor space gets from the parameter of mean vectors of the two GMM, one audio frame can project to a point in this anchor space by cosine distance, the means of all points generated by all target audio frames can be seen as the target anchor model; Finally, the anchor space is based on the sparse decomposition. The target audio data from the train data learns a target dictionary while the non-target audio data learns a non-target dictionary. The anchor space gets from the atoms of the two dictionaries, the coefficients by sparse decomposition are the coordinates of this anchor space.The experimental data are entertainment programs downloaded from the Internet. The experimental results show that all three methods can recognize the excited scene from programs very well. Especially the method based on the statistics of the state changes, when the recall is 85.67%, the false alarm rate is only 9.57%. After systematic summarization, there is still a lot of room to improve.
Keywords/Search Tags:Audio Scene Recognition, Anchor Space, Gaussian Mixture Model, Sparse Decomposition
PDF Full Text Request
Related items