| In adult age estimation,bone age,as an important biological age indicator,can aid forensic anthropologists in inferring the chronological age of a given individual.This information holds immense value in various contexts,including criminal investigation,individual criminal behavior ability determination,identity identification,and other related areas.Due to the lack of obvious aging indicators,it is relatively difficult to estimate adult bone age.Some studies have pointed out that it is a feasible method to estimate adult bone age through the calcification degree of costal cartilage.However,most of the existing research work is to observe the staged calcification of costal cartilage with naked eyes through manual film reading,and to formulate different grading scores for bone age estimation,which requires high professional level and experience of researchers,and is easily affected by subjective factors.At the same time,it also has the shortcomings of time-consuming,inconsistent evaluation criteria and low accuracy.In contrast,deep learning-based approaches have the potential to substantially minimize the time required for manual evaluation and enhance the precision and consistency of bone age estimation through an automated and standardized bone age estimation process.Based on the above research background,with the help of cooperative research group,this paper collects and establishes the adult costal cartilage data set,and takes it as the research object,introduces the widely used Transformer network architecture and multi-modal fusion strategy into the field of adult costal cartilage bone age estimation,and proposes two kinds of efficient and accurate deep neural networks.The main contents and innovations of this paper are as follows:1.In view of the strong subjectivity and low accuracy in traditional methods based on volume rendering images,a SGCAFormer network based on high-frequency feature guidance and convolutional position embedding-self-attention(Meta Former with SRM-Guided module and CPos E-Attention,SGCAFormer)is proposed.The network utilizes a high-frequency feature region-guided module to extract high-frequency information such as edges and textures from the calcified area in costal cartilage as auxiliary input to enhancing the network’s attention to high-frequency detail information.Additionally,the network incorporates a 2D convolutional position embedding(Convolutional Position Embedding,CPos E)dynamically generated from the local neighborhood of the input to help the model better capture position information and context relationships.Furthermore,an efficient and extensible Scale Re LU activation function is proposed,which can not only alleviate the gradient explosion,but also reduce the computational cost of the network.SGCAFormer has achieved a more accurate mean absolute error and higher accuracy rates on the volume rendering images dataset in this paper.2.Aiming at the problem of limited modeling and analysis ability of volume rendering images,images obtained by maximal intensity projection technique are further supplemented and a dual-branch cross-modal fusion network(Cross-Modal Fusion Network,CMF-Net)based on multi-modal fusion is proposed.The network uses two sub-branches to process the images of two modalities to extract the feature information of the specific modal domain,and applys three cross-mode fusion modules to achieve efficient multi-scale cross-modal fusion.Among them,the sub-branches use convolution layer cardinality to keep the number of convolution channels at different depths consistent,which balances the status of channel dimension and spatial dimension in the process of feature extraction,and improves the feature extraction ability of the model.The cross-modal fusion module effectively realizes the combination of complementary features and the interaction of multi-modal feature information in the form of compression-recalibration,which has higher learnability and robustness.The depth adjustment strategy reduces the network complexity and further improves model accuracy.CMF-Net further reduces the mean absolute error on the multimodal dataset and greatly improves the accuracy of bone age estimation. |