Font Size: a A A

Image Captioning And Image Security Techniques Using Deep Neural Networks

Posted on:2022-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Syeda Nuzhat Subi NaqviFull Text:PDF
GTID:1488306323962889Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Image processing is a popular research field and sub-category of digital signal processing,while image content translation and image content security belong to artificial intelligence(AI).Image understanding properties require the detection and identification of objects,scenes,locations,and interactions or relations within an image.Generating well-formed sentences requires both syntactic and semantic understanding of the language.Every day,we encounter many images from various sources such as the internet,news articles,document diagrams,and advertisements.Unfortunately,these images do not have proper labels and are not well protected against digital signal processing attacks.If humans search for particular images from a website or database source,the machine needs image interpretation and image contents security.Image captioning and image security are important for many reasons;they can be used for automatic image indexing.Image indexing is important for content-based image retrieval(CBIR),and therefore,it can be applied to many areas,including biomedicine,commerce,the military,education,digital libraries,and web searching.Social media platforms such as Facebook and Twitter can directly generate descriptions from images that require proper labeling and protection from the introducer.To fill these gaps,we developed systems that automatically generate image descriptions and provide image contents protection.Our first proposed work explored the prevailing image captioning approaches that analyze and produce source image text description as output,utilizing either encoder-decoder models in a simple manner or a combination of attention mechanisms.Both types of models face a variety of struggles and issues.The attention-based approaches pay attention to a particular area or object(s)using a single heat map that indicates which part of the image is most important rather than treating the thing(s)in the whole image equally important.Single heat map models such as Convolutional Neural Network(CNNs)and Recurrent Neural Networks(RNNs)suffer due to the individual use of global representation at the image level and,as a result,miss object(s)and misinterprets an image.Additionally,these models ignore embedding non-visual signals,which does not improve the accuracy and diversity of visual description.To address the mentioned issues,we propose a global-local and joint signals attention model(GL-JSAM).Initially,the proposed model extracts the global features at the image level and local features at the object level.The proposed model acquires detailed image features by accumulating global and local image features.The introduced new Joint Signals Attention model selects only relevant signals,discards irrelevant and repetitive features from detailed image features,and passes detailed signals to the language model.On the other hand,in the language model,Joint Signal Attention decides at each time step to pay attention to image features or language features and generate rich,accurate,descriptive,and diverse sentences.The proposed Joint Attention model plays a dual role between image and language models.The effectiveness and superiority of the proposed approach were examined over recent similar image captioning methods by conducting virtual experiments on the MS-COCO dataset.Our second proposed work targets an image descriptor by extracted dataset,which helps young children understand the image in an educational environment.Unluckily,existing popular datasets such as flickr8k,11k,MS-COCO,and many other datasets frequently used for image captioning have either complex or too generic visual descriptions,which is irrelevant for children learning.It is crucial to have suitable teaching material for young children at the initial educational stage once they have easy access to smart devices in the current digital imagery period.To fill this gap,we proposed an automatic digital image descriptor.This model used smart augmentation to develop a merge 3K-Flickr-SDD dataset from Flickr8k and Stanford Dogs Dataset(SDD).We also modified each label of the merge 3K-Flickr-SDD dataset and made them appropriate for children's understanding.Visual features extraction performed accumulating CNN(Convolutional Neural Network)whereas LSTM(Long Short-term Memory)language model customizes to generate text sequences.Avoid using RNN(Recurrent Neural Network)yields because RNN forgets generated sentences in the previous information due to vanishing gradient.We performed the quantitative and qualitative analysis;the finding reveals the outperformance of the proposed model on standard datasets compared to existing models.It also achieved significantly competitive results over both versions of the merged 3K-Flickr-SDD dataset.In our third proposed work,we explored the security concerns of image content for information-sharing prospects.Our initial attempt formulates input images into the audio cover.However,this model could easily apply to any cover media such as audio,video,voice,and text.Sharing photos over the digital network is very insecure;daily data-owners lose copyright protection and content authentication.The existing audio watermarked strategies are not robust enough against signal processing attacks.High hidden capacity makes them invisible.Single stego-layer unable to provide high hidden space and is also detectible for attackers.Achieving balance among robustness,imperceptibility,and capacity is a big challenge for state-of-the-art models.As a solution,we proposed a robust three-fold dual encrypted image-in-audio watermarking scheme that initially exploits the binary image through double encryptions,improving the watermark image security.Before watermark embedding,both the encrypted image and the host audio signal decomposed by dual-tree complex wavelet transforms(DTCWT),short-time Fourier transform(STFT),and singular value decomposition(SVD),where the SVD layer used for embedding.The three-fold transformations improve capacity,imperceptibility,and complimentary level of robustness.Experimental results revealed improved and claimed indexes against various digital signal processing attacks.In our fourth proposed work,another watermarking scheme has been presented,focusing on the security and privacy of digital data over insecure networks.Previous systems much focused on robustness,imperceptibility,and capacity but not giving priority to data security.As a solution,we propose a robust two-fold image-in-audio watermarking scheme that initially exploits the binary image through Arnold Encryption(AE)Bose-Chaudhuri-Hocquenghem(BCH)codes.The improved watermark image security ensures that intruders can't extract watermark information directly.We also improved the hidden capacity and imperceptibility by decomposing embedding input images into cover audio using dual-tree complex wavelet transforms(DTCWT),discrete cosine transforms(DCT),and singular value decomposition(SVD).The proposed scheme exhibition is predominant about better security,stronger robustness,larger embedding capacity,and higher imperceptibly against Gaussian noise,re-sampling,band-pass filter,echo,MP3,MP4 compression,and cropping attacks as compared with existing audio watermarking strategies.
Keywords/Search Tags:Image Caption, Global-Local Features, Soft-Hard, Visual Attention, Canonical, Non-canonical Signals, LSTM, CNN, RNN, Audio Watermark(AW), Copyright Protection(CP), Dual-Tree Complex Wavelet Transform(DTCWT), Short-Time Fourier Transform(STFT)
PDF Full Text Request
Related items