| Benefiting from the continuous expansion of data scale and the continuous innovation of artificial intelligence algorithms,automatic report generation of medical images has attracted people’s interest in recent years.However,identifying disease and predicting its corresponding size,location,and other medically descriptive patterns is critical for generating high-quality reports,which is a challenge.Although previous methods focus on generating readable reports,there is still a "semantic gap" problem between the visual and semantic features of images.This paper focuses on the current mainstream deep learning-based methods for generating medical image reports.Using endoscopic medical image data,a multi-modal medical image report generation algorithm is proposed.Look at the task from different angles and come up with a solution.The research content of this paper is as follows:1.Aiming at the problem of "semantic gap" between visual content and semantics,a visualsemantic mutual attention model of endoscopic image report generation is proposed.From the perspective of understanding visual content and visual semantics,a mutual attention mechanism is used to improve the correlation between visual content and semantics in images.First,extract the visual features of the image using a pre-trained convolutional neural network,and use a multi-label classifier to extract a set of visual words from the image as the semantic attribute features of the image;then,use the mutual attention module to integrate the visual features and semantic attribute features;Finally,a LSTM is used for stepwise decoding to generate medical reports with visual and semantic coherence alignment,and the aligned visual text features are used to guide the text generation process.By mining visual-related semantic attributes,the model’s utilization of semantic information is improved,and the model is assisted in generating better sentences.The method proposed in this paper is verified on the endoscopic image dataset,and the experimental results show that the method can effectively and accurately generate the diagnosis report,and the diagnosis conclusion is close to the traditional manual method.2.Aiming at the problems that the text generated by the existing models is single and the description of the lesions is omitted,an endoscopic image report generation model based on multi-feature and bidirectional GRU structure is proposed.Because texts are generally correlated before and after,this paper uses the bidirectional GRU structure to take into account the contextual connection of sequences,and fuses image features,sequence features and word features to generate a more comprehensive report description.In the process of image feature expression,the correlation before and after the report is effectively fused,which helps to improve the visual semantic expression in medical images.Using a bidirectional GRU unit also solves the problem of information loss caused by single decoder modeling.Finally,the method proposed in this paper is verified on the endoscopic image data set.The experimental results show that the method can accurately generate the diagnostic report,and the generated report is more comprehensive,and can accurately locate and describe the lesion area.Finally,this paper briefly summarizes the above research contents,further prospect the expansion and in-depth research work,and propose possible research directions and ideas. |