With the rapid development of computer technology and social network,massive data are generated all the time in daily life.How to use massive data to realize intelligent tasks has become a research hotspot.In practical application,massive data usually exists in different modalities,such as text,image,video,audio,3D model,and so on.Although the existing forms of massive data are different,the data of different modalities may have high correlation and even describe the same thing.In the current research based on multi-modal data,cross-modal intelligence and its related research utilizing the correlation between different modal data have attracted much attention due to their wide application.As a subtask of cross-modal intelligence,cross-modal generation is widely used in practical scenarios,such as computer-aided design,image editing,machine translation,information digitization,etc.From the current research,the cross-modal generation methods based on deep neural networks are superior to the cross-modal generation methods based on traditional machine learning algorithms and have become the main research direction in the field of cross-modal generation.Cross-modal generation not only generates data from one modal to another,but also requires the generated data to be highly similar to the real data so that it is difficult to distinguish.In the task of cross-modal generation,this thesis takes the cascaded adversarial network as the basic generation framework,and mainly studies the cross-modal generation methods between text,image,and 3D point cloud.The specific research contents are as follows:(1)A text-to-image generation method based on background-induced and multi-level discriminator is proposed.This method combines the cascaded adversarial network and hybrid attention mechanism to construct a multi-stage image generation framework.At the same time,the background image is added into the multi-stage image generation framework as auxiliary information.Under the joint constraint of text description and background image,the method can generate diverse images with different foreground objects under the given background.Besides,the method introduces a multi-level discriminator and the corresponding multi-level discrimination loss to further improve the performance of image generation.Experimental results on the CUB Bird dataset demonstrate the superiority of the proposed method and the ability of image generation under a given background.(2)A cascaded generation method is proposed for dense point cloud reconstruction from a single image.This method combined the pre-reconstruction network with the up-sampling network to construct a multi-stage point cloud generation network.Meanwhile,an image re-description mechanism is designed to optimize the multi-stage point cloud generation network by generating images from the reconstructed point clouds.Besides,this method introduces a Siamese structure to extract the consistent high-level semantics from multiple images to further enhance the semantic correlation between images and reconstructed point clouds.In the optimization process of the multi-stage point cloud generation network,the training difficulty of the network is significantly reduced through phased training and overall network fine-tuning.Extensive experiments on the Shape Net dataset show that the performance of the proposed method is significantly better than that of existing point cloud reconstruction methods. |