The revolutionary technologies represented by mobile Internet and intelligent IoT have become important drivers for building modern cities.The fine-grained dynamic and complex scene perception provides core technology support for location-based services,personalized prediction and recommendation services in multiple fields of applications in the upper layer.Multi-modal information can mine the scene category attributes from different semantic spaces to sense the user’s environment information.However,as the complexity of scenes increases,the upper layer application services present more serious challenges to the further development of scene perception.For example,there is a lack of potential semantic space representation for behavioral scenes,a data distribution gap for behavioral scenes of devices across locations(domains),ambiguity of environmental features in complex indoor and outdoor scenes,and difficulty in capturing road information under dynamic changes of road scenes.This dissertation focuses on the challenges of fine-grained dynamic complex scene perception technology based on multi-source signals to carry out relevant theoretical,model and application research.The key technologies of cross-modal scene perception based on implicit space information extraction,cross-domain scene perception based on joint alignment of multi-scale features,complex indoor and outdoor scene perception based on semantic mining of spatio-temporal characteristics,and dynamic road condition perception based on adaptive fusion of visual semantics are implemented in turn.The main contributions of this dissertation are as follows:(1)A cross-modal scene perception method based on latent spatial information extraction is proposed to address the problem of lack of potential semantic space representation of behavioral scenes.The method is modeled by exploring the potential semantic feature space and proposes the DiamondNet model to represent the implicit space information extraction of multimodal sensing signals across modalities.In DiamondNet,an attention-based graph convolutional subnet is constructed to adaptively mine the implicit relationships between different nodes,so as to realize the extraction of potential semantic features across modalities.DiamondNet proposes an attention-based feature fusion subnet to adaptively calibrate and fuse the multi-modal features.The features at different levels of the hierarchy are enriched with semantic dimensions while effectively achieving saliency differentiation feature enhancement.Experiments on three publicly human activity scene perception datasets show that the method is able to learn effective implicit semantic features and outperforms existing methods in terms of accuracy and F1 values.(2)A cross-domain scene perception method based on joint alignment of multi-scale features is proposed for the problem of data distribution gaps in behavioral scenes of devices across locations(domains).The method first proposes an unsupervised source domain selection algorithm to select the most suitable source domain for each target domain for domain knowledge transfer.Then the method proposes a multi-scale deep transfer learning model DMSTL,in which a generalized feature extractor MSSTNet suitable for domain knowledge transfer.DMSTL achieves the co-objective optimization of minimizing the source domain classification loss,inter-domain conditional probability distribution and inter-domain marginal probability distribution by constructing category-level adaptive module and domain-level adversarial module.Experiments on crosslocation human activity scene perception show that the method effectively reduces the domain divergence between source and target domains by the learned domain invariant features,and substantially improves the accuracy and F1 values over existing methods.(3)A complex indoor-outdoor scene sensing method based on semantic mining of spatio-temporal features is proposed for the problem ofambiguity of environmental features in complex indoor-outdoor scenes.The method analyzes and extracts the spatial geometric distribution,time series and statistical features under the original GNSS measurement values.An integrated machine learning algorithm based on two-layer stacking is designed to effectively enhance the generalization performance of complex indoor and outdoor scene perception.Multi-Sensor DL,a deep learning model for indoor-outdoor detection based on multi-source sensing signals,is also explored,which extracts multi-source spatio-temporal feature vectors to collaboratively sense different dimensions of environmental information.Experiments on the indoor-outdoor scene perception dataset show that the method effectively alleviates the problem of ambiguity of environmental features in complex indoor-outdoor scenes,reduces the indoor-outdoor scene switching detection delay to 3 seconds,and substantially outperforms existing methods in terms of accuracy.(4)A dynamic road scene perception method based on visual semantic adaptive fusion is proposed for the problem that it is difficult to capture road condition information under dynamic changes of road scenes.The method proposes a multi-dimensional visual spatio-temporal attention network MDSTA,which consists of a local spatio-temporal perceptual field module,a global spatio-temporal perceptual field module and a lane based vehicle static feature and dynamic feature extraction algorithm.By capturing the road scene contextual change relationships and vehicle change salience features,the perception model is equipped with the computational capability of multi-dimensional semantic changes of road scenes.Experiments on dynamic road scene perception show that the method is able to adaptively capture dynamic change information of road conditions,and the accuracy and F1 values on the road scene perception dataset verify the effectiveness of the method.In summary,this dissertation proposes corresponding scene perception models and methods in two dimensions:human activity recognition and human environment perception,and conducts a large number of experiments to verify the effectiveness of the above-mentioned research contents respectively,so as to provide effective theoretical support for the wide application of scene perception in upper multi-domains. |