The intelligent diagnosis classification and retrieval-based text generation of medical images are extremely valuable in the “Internet + medical health” field.These technologies can alleviate healthcare professionals’ workload and enhance the efficiency of medical image analysis and diagnosis.However,there are areas where existing algorithms for medical imaging diagnosis and retrieval-based text generation can be improved.1)it is desired to improve image resolution and prevent the issue of a poor classification model caused by minor variations between images;2)it is expected to enhance the correlation between cross-modal data and reduce the problem of the heterogeneity gap between cross-modal data;3)it is predicted to optimize the retrieval text and generation model fusion technique and eliminate the problem of redundant information in the generation process.To address these issues,the main research contributions of the dissertation are as follows:(1)Transformer-based factorized encoder for 3D CT image classification.The Xray image is widely employed in existing diagnostic classifications;however,it lacks adequate semantic information to distinguish between images of the same disease at different stages.Furthermore,because of the enormous number of convolution operations,existing CT-based methods are computationally expensive,and there is a lack of long-range interaction between CT slices.To address these issues,a highresolution dataset of pneumoconiosis CT images is developed,and a transformer-based factorized encoder is proposed to investigate remote interactions between and within CT slices and solve the problem of model imperfection caused by minor differences between images.The accuracy of the method was 2.94% higher than that of the Covidnet method.(2)A unified perspective of multi-level cross-modal similarity for cross-modal retrieval.Existing algorithms for evaluating cross-modal similarity ignore the local relationship between cross-modal data,limiting the model’s performance.Additionally,while calculating label similarity,the classifier’s classification bias can impact retrieval accuracy.In this paper,A unified perspective of the multi-level cross-modal similarity method is proposed,which can measure multi-level cross-modal similarity in the same common feature space.The average normalized discounted cumulative gain(NDCG)for the multi-modal datasets Pascal Sentence,Wikipedia,and XMedia Net are improved by 3.6%,3.7%,and 6.5%,respectively,compared with the DSRAN method based on dual semantic relations.(3)A semi-supervised cross-modal memory bank for cross-modal retrieval.When measuring the correlation between unlabelled data,existing algorithms assume that unlabeled data are correlated with predefined k-nearest neighbors,which leads to false connections between two sets of unrelated unlabelled data,and that can reduce the accuracy of cross-modal retrieval.To provide accurate supervision information for unlabelled data,a semi-supervised cross-modal memory bank is proposed to revise pseudo-labels by both the feature representation of the paired cross-modal data and the class probability of the labelled data.With a supervision rate of 10%,the average MAP@50 of the proposed method on Wikipedia,NUS-WIDE,and MS-COCO increased by 2.6%,1.8%,and 4.9%,respectively,compared with the semi-supervised method SCLss.The experimental results demonstrate that the proposed method outperforms existing methods.(4)Retrieval-based adaptive fusion strategy for medical image report generation.When previous methods use X-ray images as the generative model’s input,the variation between images is low,resulting in a high resemblance of the generated text.Furthermore,the flaw of the fusion strategy results in a considerable amount of redundant information in the created text,lowering the quality of the generated text.To solve the above problems,we collect CT image-text data from 8 lung diseases and propose a retrieval-based adaptive fusion strategy.The strategy adds the retrieval probability with a weight to the generation probability to realize the dynamic fusion process.Compared to the fusion method without weight,the consensus-based Image Description Evaluation(CIDEr)scores of the proposed method improved by 15.9%.The experimental results show that generated text by the proposed method is more similar to human-generated text. |