Font Size: a A A

Visual Data Understanding Based On Deep Encoder-Decoder Framework

Posted on:2019-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:S H LiFull Text:PDF
GTID:1368330623450366Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Visual data understanding is to mine the information in images and videos,and to form a structured descriptive text,to a certain extent,spanning the semantic gap between visual data and human understanding.With the advancement of the era of big data,the ability of human to analyze visual data is far lower than the ability to obtain visual data.It is urgent to translate the semi-structured or unstructured data such as images and videos into structured data that can be understood directly by the computer through visual data understanding methods.This urgent demand has prompted the transformation of traditional machine learning methods,and the birth of deep learning technology.According to the characteristics of visual data,we unify visual data understanding into sequence recognition problem,and use depth encoder-decoder framework to solve this problem.Visual data in natural scene is divided into images and videos.Because there is temporal and spatial correlation between frames in video,we separately consider videos and images,and study image understanding methods and video understanding methods.These two types of understanding methods can parse middle and high level semantics in images and videos and output descriptive text.However,the descriptive text only includes the high-level semantics.It does not analyze the text appearing in the visual data.The text,as an important information carrier in visual data,also contains rich and precise high-level semantics.We also study the text recognition methods in natural scene.In detail,the main achievements of this paper include:(1)We propose an image caption method based on multi-directional two-dimensional long short-term memory network.In the traditional encoder-decoder model,the existence of fully connection layer in CNN makes the local information of the image lost.In the encoding stage,we use the two-dimensional long short-term memory network to encode the deep feature map of the image,and extract the correlation between objects in the image as a local feature.At the same time,the reference multi-directional one-dimensional long short-term memory network can encode one-dimensional data from two directions,and we use four-directional two-dimensional long short-term memory networks to encode image feature maps from four directions,so that the local features of our model is more diverse.In the decoding stage,we use both local and global feature for decoding.The experimental results show that the model presented in this paper outperforms the state-of-the-art techniques on multiple evaluation metrics..(2)We propose an attribute description method based on the attention mechanism.In traditional image understanding methods,due to limitations of the model itself and the training dataset,the resulting description only contains the object's category and the relationship between objects,ignores the description of the object attributes.However,these attribute information is often an important element of image semantic information.The traditional attention mechanism focuses on the characteristics of the object category,not the object attributes.We improve the attention mechanism,and regards the channel of the feature map as a dimension that the attention mechanism needs to focus on.We collects and preprocesses the data on the commercial website ETSY,and constructs the dataset ETSY-C which describes the clothes attributes.The experimental results show that our model outperforms the current image caption methods.(3)We propose a video caption method based on the attention mechanism.The video caption methods at this stage use CNN encode video sequence as a feature sequence,and then integrate the feature sequence into a feature vector.Although CNN can extract the deep features of video frames,simply integrating feature sequence into one-dimensional feature vector inevitably loses the relationship information between video frames.Inspired by the attention mechanism,We integrate attention mechanism into the encoder-decoder framework.Thus.our model can focus the relevant features in the feature sequence at different times.This allows our model to automatically tap the relationships between video frames under weak supervision.Comparing the model with related methods on the standard datasets,We can find that our model is comparable to or outperforms the state-of-the-art techniques on multiple evaluation metrics.(4)We propose an end-to-end training method for natural scene text recognition.Traditional text recognition methods are based on a bottom-up approach,which requires image segmentation,character detection and recognition,and character integrating.The cascade of independently trained detectors,classifiers,and integrators makes the final recognition less effective.We integrate attention mechanism to train an end-to-end network which recognizes the scene text without the image segmentation.Text is composed of characters.A single character is not only related to the character in front of it,but also has a certain correlation with the character behind it.Considering the poor ability of the attention mechanism to extract the global information of features,we introduce a review network to integrate the global information of features.The experimental results show that our model outperforms the state-of-the-art techniques.
Keywords/Search Tags:visual data, deep learning, deep encoder-decoder, convolutional neural network, long short-term memory network, attention mechanism
PDF Full Text Request
Related items