At a cocktail party, we can selectively attend to a single voice and filter out all the other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper raised a new way for overlapping speech segregation based on sound localization cues. In this paper, we first divide the speech stream into some time-frequency regions and calculate the ITD and IID of each region. Then the notion of a "time-frequency" binary mask is given, which selects the target if it is stronger than the interference in a local time-frequency region. Finally, we regroup the selected time-frequency regions and get the resynthesized speech. The results obtained indicate that the approach described here is efficient.
|