| With the continuous improvement of the level of science and technology,portable electronic devices such as mobile phones and tablet computers used in daily life can easily capture high-resolution images,making it very convenient to obtain image data in real life.However,two-dimensional images cannot directly represent the real threedimensional world,so the research on image-based three-dimensional understanding has received more and more attention in recent years.The input image usually reflects the structure of the scene,as well as information such as furniture and daily necessities placed in different positions and orientations in the space.However,there are still many difficulties in this field,such as objects occluding each other in a single view and complex types of objects.Based on a single view of an indoor scene,this paper uses deep learning methods to predict the layout of indoor scenes,the location of objects in the image in space,the shape of objects,and the relationship between objects.Among them,in terms of shape representation,this paper will show three different indoor object representation methods(whole 3D reconstruction,semantic structure 3D reconstruction and shape matching 3D reconstruction)and indoor scene understanding schemes combined with semantic information.The main work of this paper is as follows:1.In the single-view holistic 3D reconstruction method,inspired by the attention mechanism,a holistic 3D reconstruction method based on the self-attention mechanism is proposed,which is mainly applied to improve the vertex position from the image to the 3D shape generation.The problem of accuracy enables it to predict more accurate vertex positions by adaptively adjusting the influence weights between vertices during the training process.2.In the single-view semantic structure 3D reconstruction method,the object is regarded as the whole obtained by the structural integration of different sized cuboids according to the structure tree information.A partial structure feature extraction module is designed to reconstruct the three-dimensional shape of each part of the structure according to the feature.Finally,according to the structure tree information,the partial structure three-dimensional shape is integrated to obtain a complete object as the reconstruction result.3.In the single-view shape matching 3D reconstruction method,a 3D shape database containing various types of objects is first constructed,and then the most approximate 3D shape in the database is predicted according to the image features as the result of shape matching,and cross entropy loss is used during training.With Focal Loss to solve the problem of uneven sample distribution in the real scene dataset Pix3 D,and finally surpasses other methods with a category average chamfering distance of0.272 mm.4.The shape matching 3D reconstruction method is trained jointly with the existing layout estimation network and target detection network,and then combined with the object relationship detection,to obtain the final object relationship description,and the object represents a complete indoor scene understanding result. |