| Autism Spectrum Disorder(ASD)is one of the most common developmental disorders characterized by impairment of social interaction and communication skills,as well as stereotype behaviors.The early diagnosis has been focused on the analysis of EEG and MRI,which requires sophisticated medical equipment and the data collection process is cumbersome.Based on the differences of appearance characteristics between ASD and Typical Development(TD)children,we propose two approaches for early diagnosis of autism.The first method is based on facial expression recognition and head pose estimation for early diagnosis of autism.In this paper,a facial expression recognition model and a head pose estimation model are proposed.Both models use Conformer that combines CNN network with Transformer network,which can extract both global features and local features.The facial expression recognition model takes RGB image and LBP image as inputs at the same time,and introduces the dynamic weight adjustment module to adjust their weights.After fusion,the features are sent to the network to extract facial expression features.The head pose estimation model only uses RGB images as input,extracts features through Conformer network,and finally connects three fully connected layers to divide yaw angle,pitch angle and roll angle into seven categories.In this study,the extracted features are processed by cumulative histogram respectively,and finally,children are classified by long-term memory(LSTM).The second method is based on gaze estimation for early diagnosis of autism.A gaze estimation model is proposed in this paper.The model draws on the experience of Conv Next network.Only by convolution neural network,not only local features can be extracted,but also global features can be extracted.Compared with the Conformer network,this model needs less computation.In order to further improve the performance of the model,the Large Kernel Attention(LKA)module in Visual Attention Network(VAN)is added to the model,and the Receptive field is further enlarged.In this study,the center of the eyes is obtained by the key points of the face,and the screen coordinates are obtained by the vector relationship.Divide the screen into 12 areas according to the coordinates.Considering the time information of video frames,the accumulative histogram is used to process features,and finally LSTM network is used to classify children.Both methods are verified on the self-collected ASD video dataset(ACVD),and the experimental results prove the effectiveness of the proposed methods.Among them,the accuracy of the method based on facial expression recognition and head posture estimation is 97.59%,and the accuracy of the method based on gaze estimation is 94.6%. |