| With the rapid development of medical informatization,hospitals have generated and piled up a large amount of medical data,such as X-ray images,CT images,MRI images,ultrasound images and the corresponding diagnostic reports.They are composed of data of different modality types,yet semantically related to each other,such as medical images of the same pathology and corresponding diagnostic reports.Cross-modal retrieval of medical images and corresponding diagnostic reports becomes an important research topic in medical cross-media intelligence.Cross-modal retrieval between chest X-ray images and diagnostic reports refers to the retrieval of corresponding diagnostic reports(chest X-ray images)by chest X-ray images(diagnostic reports)to provide reference for doctors’ diagnosis,which greatly reduces the workload of radiologists and improves diagnostic efficiency.The main challenge of cross-modality retrieval between chest X-ray images and diagnostic reports is that there is a "semantic gap" between high-level semantics and underlying representations in the same modality and a "heterogeneous gap" between representations with the same semantics in different modalities.The existing methods for chest X-ray image and diagnostic report retrieval focus on global information alignment and ignore the fine-grained semantic association between images and diagnostic reports,resulting in low retrieval accuracy and poor matching.In this thesis,we propose two cross-modality retrieval methods for chest X-ray images and diagnostic reports using an end-to-end deep learning approach to solve the "fine-grained semantic gap and heterogeneous gap" between chest X-ray images and diagnostic reports in medical scenarios.The main work of this thesis is as follows:1.The existing methods of chest X-ray images and diagnostic reports focus on global information alignment,ignoring the fine-grained semantic association between chest X-ray images and diagnostic reports,resulting in low retrieval accuracy and poor matching degree.Therefore,processing Chest X-ray images and Diagnostic reports for Twin-towers Cross-modal Retrieval(CDTCR).Specifically,for fine-grained semantic representation,an image encoder composed of residual network is proposed to learn the fine-grained features of the image and a Bert model composed of transformer is proposed to learn the finegrained semantic features of the diagnostic report;in order to solve the problem of fine-grained semantic association,an information alignment strategy of two different granularity modes,the global image to sentence and local region to phrase,was designed to solve the problem of insufficient fine-grained semantic association between different modes.The experimental results on a large-scale medical dataset MIMIC-CXR show that CDTCR has higher retrieval accuracy and better interpretation than the existing cross-modal retrieval methods.2.To address the problem that cross-modality retrieval methods between chest images and reports focus only on global representation alignment,neglecting different granularity semantic associations between modalities and multi-label semantic information within modalities,we proposed a Cross-modal Retrieval based on Cross-attention and Category-supervised for Chest X-ray images and Radiology reports(3CRCR).Firstly,cross-attention is used to learn local fine-grained semantic associations and to mine the deep semantic information in different modalities,so that the model focuses on learning the local region representation corresponding to the pathology.Then,the semantic interaction alignment strategy at different scales is designed to match the different granularity semantic associations between the two modalities of chest X-ray images and radiology reports.Finally,multi-class labels are used as supervised information to constrain the learning of different granularity semantic representations between modalities,resulting in multi-level fine-grained semantic representations for different modalities.The experimental results show that the proposed method(3CRCR)has significantly improved the average retrieval accuracy m AP compared to the existing cross modal retrieval methods on largescale medical dataset MIMIC-CXR. |