Font Size: a A A

Research On Feature Extraction And Fusion Of Audio Visual Information

Posted on:2022-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q JiangFull Text:PDF
GTID:2518306524481254Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence technology,the methods of using image or sound information to represent the target are increasing.Due to the diversity and complexity of information in the physical environment of the target,it is difficult to fully represent the perceptual target using visual or auditory information alone.Therefore,this thesis focuses on the research of feature extraction and fusion method of visual and auditory information to realize the comprehensive processing,fusion and perception of visual and auditory information of target in low SNR environment.The main research work is as followsFirstly,this thesis establishes a data set containing 900 second sound and 1150 pictures.According to the noise interference and abnormal gain in the actual environment,the initial data set is amplified by setting signal-to-noise ratio,gain and other parameters,and the amplified data set containing 9955 second audio and 12595 pictures is obtained.Then,this thesis analyzes the influence of the order of residual structure on the network performance,and proposes an auditory information feature extraction model based on improved residual structure and a visual information feature extraction model based on multilayer convolution neural network.For the proposed model,classification experiments are carried out on the public data sets esc-50,cifar-10 and the test set established in this thesis,and compared with the pre-training models vggish and vgg19,which proves the effectiveness of the feature extraction model.Then,based on the theory of model fusion,feature stitching and correspondence autoencoder,an improved model of audio-visual information fusion based on correspondence autoencoder is proposed.The model adds the hidden layer Association loss of audio-visual information on the basis of autoencoder,so as to obtain the hidden layer representation of audio-visual information,the regularization term is added to the loss function to avoid the over fitting tendency of the hidden layer representation of visual and auditory information and keep the availability of the hidden layer information.Finally,the F1 score evaluation index and t-sne evaluation method are used to evaluate and analyze the experimental results of the above feature extraction and fusion methods.The results show that the highest accuracy of target recognition is 47.5%,and the highest F1 score is 0.407;When only using visual information representation,the highest accuracy of target recognition is 60.8%,and the highest F1 score is 0.611;When using the audio-visual information fusion method based on correspondence autoencoder,the target recognition accuracy is 84.2%,and the F1 score is 0.846,which is at least 23.4%higher than that of using visual or auditory information representation alone,and the F1 score is at least 0.235 higher than that of using visual or auditory information representation alone,which effectively improves the target perception performance in low SNR environment.
Keywords/Search Tags:Object Perception, Feature Extraction, Audio-Visual Fusion, Neural Network, Correspondence Autoencoder
PDF Full Text Request
Related items