Font Size: a A A

Style-based Cross Domain Image Captioning Technology

Posted on:2022-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhuFull Text:PDF
GTID:2518306731978019Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Witnessing the recent development in deep learning,especially for the success of the encoder-decoder framework,the performance of image captioning has been greatly improved.However,the high performance of the existing models mainly depends on the type and style of labeled data,and would bear a sharp decline once the model is transformed to a different area.Although several works including adversarial learning and dual learning technologies have tried to solve this problem by bridging the big gap between source and target domains,the performance is still unsatisfactory due to the following reasons: Firstly,the current methods mainly focus on minimizing the gap between two domains: source and target,and are incapable of deeply understanding and analyzing language style;Secondly,the models focus on how to accurately understand image contents,and ignore the learning of languages' grammar structure;Finally,the existing models often output a single-style description for image captioning,and are unable to adaptively generate multiple expressions of image content.To tackle the above problems,this thesis aims to improve the performance of crossdomain image captioning by exploring researches from three aspects: style information,model structure,and training schemes.The main work are listed as follows:1.For language style expression,this paper discards the traditional binary division(source style / target style),and adopts the constituent tree structure to distinguish different language styles.The structured clustering algorithm is employed to generate multiple styles,which is more reasonable.2.For model designing,this paper proposes an I-LSTM structure based on “instruct gate”.The decoder receives a part-of-speech instructional information through the instruct gate,and the model jointly learns two tasks: image captioning and language grammatical structure.3.In model's training,besides the traditional image captioning loss,this paper introduces a style-matching loss to measure the style consistency of generated sentences.Therefore,the model could adaptively generate a variety of descriptions according to different instructions,so as to flexibly describe image semantics from multiple directions.In order to verify the effectiveness of our method,we conduct extensive experiments based on MSCOCO as source dataset,and Flickr30 K,Oxford-102,CUB-200 as target datasets.The experimental results demonstrate that the style-guided cross-domain image captioning model could achieve significant performance improvements measured by METEOR and CIDEr,which proves that the style information is helpful for cross-domain image captioning task.Moreover,our model has solved the diversity problem of generated descriptions in image captioning task to a certain extent,which is valuable in practice.
Keywords/Search Tags:Image captioning, Encoder—Decoder, cross domain, language style
PDF Full Text Request
Related items