Automatic Segmentation And Clustering Of Multi-genre Audio Method Research And Implementation

Posted on:2022-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y Su

Full Text:PDF

GTID:2518306320484674

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

The audio data downloaded,monitored and collected in the real scene belongs to multi-genre heterogeneous data,which has the characteristics of variability,complexity and multi-level.It may come from different acoustic conditions such as broadband,narrow band,near field and far field,and contains different kinds of audio,such as music,noise,speech over music,etc.Audio data streams are usually long and contain an unknown number of speakers.In order to obtain the audio type distribution and speaker distribution information of such audio data stream automatically,it is necessary to study the automatic segmentation and clustering method of multi-source speech with high performance as the support,and divide the input audio data stream into different acoustic categories of speech fragments,that is,the audio type segmentation problem of multi-source speech.The effective speech parts are further segmented into different speakers,and the segmented speech fragments are clustered and merged,that is,speaker segmentation and clustering.Aimed at the problem of audio type segmentation of multi-source speech,the multi-genre audio data set used for model training and testing was structured through data preprocessing and normalized annotation,and the open source voice data set was selected to expand the training set.The convolutional neural network based speech,music,noise classifier and speech,speech over music classifier are trained.After removing the mute segment by the mute detection based on the energy threshold,the audio segmentation is completed by cascading the speech,music and noise classifier with the speech,speech over music classifier.In order to solve the problem of poor robustness of speech,speech over music classifier on multi-genre out-of-domain test set,K-means feature extraction and model training of speech,speech over music classifier were completed.According to the functional characteristics of different scale features,the fusion method of K-means features and spectrogram features is adopted to learn from each other.In the multi-source speech test set,the recall rate of the model trained by spectral feature is 4.36 higher than that of the model trained by single expression,and 2.53 higher than that of the model trained by K-means feature.Aimed at the problem of speaker diarization,three kinds of embedded feature extraction for deep learning are completed,which are embedded feature extraction based on fully connected neural network,embedded feature extraction based on gated logic unit and embedded feature extraction based on deep residual neural network.In order to solve the problem that the traditional clustering method cannot determine the number of clusters,UIS-RNN was selected as the clustering model,and compared with the traditional clustering methods such as K-means clustering and spectral clustering,the experiments were carried out.The comparative experiments reveal that the UIS-RNN based back-end clustering model improves the DER(Diarization Error Rate)by 6.64 compared to K-Means and 1.6 compared to spectral clustering.To achieve automatic segmentation and clustering of multi-genre audio,in complete audio segmentation and speaker diarization module,in this paper,the segmentation of multi-genre audio is taken as the preprocessing part of the speaker diarization module.After locating the speech fragments,the speaker diarization module is used to segment and cluster the speakers.Experimental results show that the performance of the multi-source automatic segmentation and clustering algorithm on broadcast news data sets improves the DER by 17.11 compared to using speaker diarization algorithm alone.

Keywords/Search Tags:

multi-genre audio, audio segmentation, speaker diarization, speaker embedding, UIS-RNN

PDF Full Text Request

Related items

1	Design And Implementation Of Speaker Diarization System
2	Robust Speaker Modeling in Non-Neutral Environments with Application to Large Scale Multi-Speaker Audio Stream
3	A Study On Speaker Diarization Based On Multiple Features
4	Research On Speaker Diarization Based On Deep Learning
5	Audio segmentation for meetings speech processing
6	Research And Implementation Of Key Technology In Speaker Diarization System
7	Research On Speaker Log System Based On Bayesian Method
8	Audio Processing In Content-based Video Retrieval
9	Speaker Diarization: Current Limitations and New Directions
10	The Modeling Research In Speaker Diarization