| Human action analysis technology is to accurately locate and recognize human body action in videos,analyze action elements,judge action execution quality,and give quantitative or qualitative evaluations.It is a hot topic in current computer vision research.With the continuous development of video and network technology,it is widely used in transportation,medical care,action teaching,sports events and other fields,and has a very high development prospect.However,how to accurately predict video actions in human action analysis,how to improve the accuracy of action recognition,and how to evaluate action completion with high quality are currently urgent problems to be solved.On the other hand,existing human action analysis methods have problems such as loss of fine-grained information,redundant background information,and body occlusion.This thesis uses computer vision and deep learning to study the above problems.Specific work is as follows:(1)Human action segmentation.Human action videos may contain important information such as local fine-grained information,and the feature extraction network can more easily perceive these fine-grained information through self-attention.In this thesis,the self-attention mechanism is used to simultaneously segment and identify multiple sub-actions in long videos,and enhance the ability of convolution to extract features.Combining each layer of convolutional feature maps in the encoder with features generated by self-attention,using both local fine-grained and global information for a series of frame actions,effectively learning temporal structure representations through an interactive self-attention mechanism.The effectiveness of the feature extraction by the interactive self-attention mechanism is proved by the ablation experiment.Compared with the baseline network in the BREAKFAST dataset,the frame-level prediction accuracy(Acc)and the prediction sequence similarity(Edit)are increased by 3.3% and 3.7%,respectively.(2)Human action recognition.Spatio-temporal features and action feature extraction are two complementary and key information for video action recognition.In many cases,modeling is based on scenes or backgrounds,without further analysis of fine-grained action feature representation.This thesis uses a multi-branch structure to extract spatio-temporal features and motion features,extracts frames at different intervals,decouples redundant information such as scenes,effectively extracts key features,and fuses spatio-temporal features and motion features to deeply mine motion features and spatio-temporal features the relationship between.The effectiveness of motion feature modeling is proved by ablation experiments.Compared with the baseline network on the HMDB51 dataset,the action recognition accuracy(Acc)is increased by 2.9%.(3)Human action assessment.Subtle differences exist between human action videos,and the relationship between videos can provide accurate and important clues for training and inference.At the same time,not all segments of the action video contribute to the final score,and different sub-actions contribute differently to the final score.This thesis adopts the sub-action video comparison framework,learns relative scores through video comparison,regresses relative scores between input videos and example videos as a reference,and uses sub-action features at different stages to regress overall scores to highlight partial and overall differences between videos.The effectiveness of the sub-action contrast regression is proved by the ablation experiment.Compared with the baseline network on the MTL-AQA dataset,the correlation coefficient(Corr)of the action evaluation score has increased by 3.3%. |