Image fusion aims to merge key information of multiple images attained from single or various sensors to produce a high-quality image which is more informative for computer processing and visual perception.Conventional image fusion algorithms apply basic image processing techniques to process images in various domains and generate fused image.These methods manually developed efficient image transforms,activity level measurement,and more complex fusion strategies which involves heavy computations to get state-of-the-art performances.It is very crucial to create an efficient fusion paradigm that can be implemented easily without high computational cost.With the evolution of Deep learning(DL)techniques,convolutional neural networks(CNN)have demonstrated considerable advances in image processing and visual recognition tasks compared to conventional approaches.CNN can be employed to solve image fusion problems,and they can be efficiently designed to learn image transformation,activity level measurement,and fusion strategy in a single network.Current fusion approaches in particular,DL-based multi-focus image fusion(MFIF)techniques,considers MFIF as a classification task.To classify pixels,these techniques employ CNN as a classifier.However,in the absence of labeled data,current DL-based supervised models for MFIF,apply gaussian blur on focused images to generate training data.DL-based unsupervised models for multimodal image fusion are also very simple and consist of shallow architecture.As a result,designing an unsupervised DL model with customized architecture remains a focus of ongoing research to increase the performance of DL-based image fusion approaches.This thesis provides insight into unsupervised learning frameworks for multiple image fusion tasks.The main interest of this study is to exploit feature learning ability of CNN to improve the efficiency by customizing the standard architecture of CNN with distinct variations.In this study,multiple encoder-decoder frameworks have been introduced to achieve fusion of images in a new domain.Furthermore,this thesis has also developed a new objective function to calculate the information loss during network training.The proposed models have particular advantages over conventional techniques such as,better efficiency,generalization,less component count,and above all is the end-to-end solution for image fusion applications.Consequently,the methodologies and their components have been investigated in detail.The efficiency of the developed techniques has been evaluated by extended experiments on benchmark datasets and detail analysis are conducted to develop a design of framework.The initial work is to explore a DL framework in an unsupervised domain to effectively fuse the multi-focus source images without losing any key details.An encoder-decoder-based framework is considered in which multiscale convolutions with standard convolutions are employed side-by-side in the encoders to secure multi-contextual features of the input images.Similarly,in the decoder multi-scale convolutions are also employed for the accurate reconstruction of the fused images.A new customized loss function based on structural similarity and pixel information is developed to monitor the network training loss.The developed model performed much better for multiple fusion applications.Additionally,a better unsupervised end-to-end wide and deep framework is developed to achieve improved fusion results.In the encoders of the network,dense connections between layers are introduced.This approach enables the layers to extract image features efficiently.Prior to the complicated distribution,distinct types of local dense features are extracted at multiple levels.Features of each level are fused in dense feature fusion.Features of various distributions are fused in global feature fusion,and reconstructed with the decoders to produce the fused image.Further,the work is extended to develop a novel end-to-end DL-model for multimodal image fusion.In order to extract multi-contextual details of multimodal images,dilated convolutions are employed in the encoders instead of standard convolutions.This strategy not only reduces the computations,but also reduces the network parameters.A new fusion strategy is developed for the more accurate fusion of extracted features.The self-attention technique is adopted for refining and adaptively fusing multi-contextual features of multiple modalities.Furthermore,a new training dataset is utilized for network training which contains images obtained from multiple sensors.Finally,the model is tested for infrared and visible as well as multiple medical modalities to ensure the generalized nature of the model.To validate the performance,extensive qualitative and quantitative experiments on benchmark datasets for multi-focus images,multi-exposure images,infrared,visible images and medical images has been conducted and discussed in detail. |