| With the continuous development of 3D acquisition equipment,3D models are generally valued in the field of computer vision.The existing feature fusion methods for classifying and retrieving 3D models based on point clouds and images ignore internal feature information and complementary information between modalities,the loss of fusion features and poor retrieval accuracy are brought about by a lack of high-dimensional correlation between classification labels and predicted features.In contrast,the majority of the current approaches to retrieving 3D models from sketches focus on the design of conventional deep networks and cross-domain algorithms,mapping the features of various modalities to a single feature space,and all of these approaches center on the transformation of feature vectors.When feature generation,cross-modal visual associations are disregarded.In light of the aforementioned issues,this research suggests two 3D model retrieval approaches that are driven by multimodal feature fusion and word embedding as well as a 3D model retrieval method that is driven by multimodal feature fusion and knowledge distillation utilizing deep learning techniques.Through theoretical analysis and practical comparison,the efficacy of this technique is demonstrated in this study.It shows that this method can get precise 3D model feature descriptors and,when compared to many other existing methods,it considerably increases the retrieval accuracy and classification of 3D models.The following are the key research findings and contributions of this paper:(1)The current feature fusion approach ignores internal feature information and intermodal complementary information when categorizing and retrieving 3D models based on point clouds and pictures,which leads to a problem with fusion feature loss.A strategy for fusing multi-modal features is suggested.In the feature extraction process,feature extractors are used to extract 3D model features from point clouds and views,align features of various modalities by sharing space,compute cosine similarity across various modalities to enhance modal features,and then combine enhanced modal features to generate fusion features,therefore minimizing heterogeneity between various modalities.(2)A word embedding strategy is suggested to direct the training of the multi-modal fusion network in order to address the low retrieval accuracy and the lack of highdimensional correlation between the classification labels and prediction features in the 3D model field.By computing the multi-modal fusion network,the similarity between the predicted features and the vectors of the true classification labels is established,allowing for the realization of a unified representation and classification retrieval of 3D model features.(3)A knowledge distillation approach is presented to limit and direct the feature extraction of sketch networks in order to address the issue of insufficient cross-modal visual correlation and domain differences in the feature extraction process of existing sketch retrieval 3D model methods.To jointly build the knowledge distillation guiding function,the retrieval feature vector acquired by the sketch feature extraction network is merged with the target feature vector obtained by distilling the multi-modal fusion network.By doing this,it is possible to achieve semantic congruence between several domains,minimize cross-modal disparities,and enhance the generalizability and retrieval precision of 3D models retrieved by drawings. |