Font Size: a A A

Effective Feature Extraction On Sound Event Recognition

Posted on:2017-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z P XieFull Text:PDF
GTID:2308330485953724Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Sound Event Recognition (SER) is an emerging research topic of sound recogni-tion tasks and attracted much interests of researchers due to its promising applications. Massive ranges of sounds like door-knocking, claps, footsteps and even birds chirping can be recognized through the technology of SER. After knowing the information of sound, it can help the machines as well as humans to perceive the environment around and detecting the changes. SER system opens up a wide range of applications include security monitoring, acoustic surveillance and smart machines, as well as providing a probability of better human-robot interaction experience.On one hand, the pioneers of SER had explored a lots of recognition systems of sound events with kinds of feature extracting methods during the last decades, which re-sulting a good breakthrough and performance. On the other hand, with the development of artificial intelligence technology, Deep Learning methods like neural networks are famous for its powerful abilities of feature extraction and modeling. And it has made great improvements among pattern recognitions, automatic controls and smart robots, especially in the field of speech recognition and image understanding. Recent SER sys-tems can achieve good performances under the conditions with less noises, but it suffers from severe performance degradation in the presence of high power of noise.To solve this problem, this thesis focus on the methods of extracting robust features of sound events with deep structures of powerful neural networks, to further improve the performances of SER system especially under low signal-to-ratio (SNR) conditions. The proposed methods include the following three aspects:Firstly, we proposed a non-linear mapping method of extracting effective spec-trogram feature (SIF) in both time and frequency domain. We know that the tradi-tional spectrogram features contain time and frequency information simultaneously and to obtain discriminative features, we proposed a data-driven down-sampling methods. Among frequency domain, we decide the sampling boundaries by first analyzing the differences of frequency distribution between sound events and noises. The frequency distribution information is revealed by the variances of different frequency bins. In time domain, we uses the Fibonacci sequences to down-sampling different lengths of frame windows. After non-linear mapping processing, new features are fed into the deep neural networks for further abstract-features extraction. The experiment results show that SIF with this non-linear mapping methods on frequency domain can improve the recognition accuracies, especially under noisy conditions.Secondly, we proposed an effective feature extraction method by merging two dif- ferent features based on neural networks. After introducing the cochleogram feature (CIF) to SER field, we tried to merge it with SIF by using deep neural networks (DNN) and convolutional neural networks (CNN), to obtain better features. Then two alter-native merging methods arises, which are called "two-channel feature merge" and "raw-image feature (RIF) merge". The first approach utilizes the property of CNN, which can input more than one images into the networks simultaneously through dif-ferent channels. So we set SIF in one input channel and CIF in another channel, then merge this two features together after calculating the first corresponding convolution processing. Instead of putting two different image-features into two different channels, RIF just simply concatenating these two features before sent them into neural networks. The experiment results show that the merged-feature outperformance any single feature. And the RIF with CNN performance better.Finally, we extended the usage of feature merging methods on image features of different resolutions. Inspired by the evidence that different resolution image features emphasize complimentary information, we explored merging these time-frequency im-age features with two methods mentioned above. The motivation is to incorporate both local and global information through multi-resolution extraction. The local information is produced with small frame length and small smoothing windows, while the global in-formation is produced with larger frame length and windows. Therefore, merging them together is better than using them separately. And the experiment results also prove this hypothesis. These methods further improve the recognition accuracies and robustness.
Keywords/Search Tags:Sound Event Recognition, Feature Presentation, Feature Merge, Deep Neural Network, Convolutional Neural Network, Spectrogram Image Feature, Cochlea- gram Image Feature
PDF Full Text Request
Related items