Font Size: a A A

On Stacked And Deep Neural Netword With The Applaction Of Speech Separation

Posted on:2015-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2268330428482870Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech is the most important medium of human communication, and most speeches appear in the noisy environment. People with normal hearing can comprehend the speech without affection by noise. But the hearing disorder and voice recognition systems are difficult to deal with the speech which mixed with noise, since dealing with the polluted speech requires speech separation processing. Speech separation is a process to remove the noise in the speech, which also means separating the target speech from background noise. The computational auditory scene analysis (CAS A) theory analyzes the process of human’s speech separation. And it also studies the representation method of speech signal and proposes a computing target of the speech separation. To complete the voice separation task based on CASA would be a promising research area. At present, according to the CASA, researchers formulate the speech separation as binary classification, for each speech separation unit (Time Frequency unit, T-F unit) to judge whether it belongs to a class of noise or a class of target speech. Researchers use complex features, and classify the separation unit one by one, recently. Extracting those complex features is time-consuming. Extracting complex feature and only dealing with one unit at a time make the time complexity of the entire process very high, which greatly limits the application of CASA method. For example, it will be hard to apply these methods in resource-constrained and real-time required equipment, such as hearing aids.To overcome the shortages of recent methods——high time complexity caused by "using complex features" and "dealing with one unit at a time", one hand, we use simple features, which simplifies the steps of feature extraction, and reduces the amount of computing, on the other hand the proposed method can generate results in batch. By that way, we have increased the speed of the whole speech separation process. In addition, in order to further improve the classification accuracy of speech separation system, we use the stacked neural network, which is comprised of multiple basic networks, and can model complex function relations. The stack structure is formed by stacking basic network one to another. And by making the input of upper network include the output of the lower network, the upper network’s work can be done on the basis of the lower network’s. Then with the network layers increases, accuracy can also gradually improved. Similarly, deep neural network which is a neural network with more hidden layer, can also model complex function relations. But the stacked neural network has higher flexibility, which enables us to use guidance information to affect the training process. It is the guidance information that further enhance the performance of the speech separation system. Therefore, we use stacked neural network instead of deep neural network, in this thesis.We compared the proposed approach with the known best method which based on deep neural network and support vector machine, under the same experiment data set. Our method is slightly better on accuracy performance and has much improved on processing speed, which makes the whole speech separation process completed in real time.
Keywords/Search Tags:Stacked neural networks, Deep neural networks, Speech segregation, Computational auditory scene analysis (CASA)
PDF Full Text Request
Related items