Font Size: a A A

Spatio-Temporal Data Mining In Imbalanced Data Distribution Scenario

Posted on:2024-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:P K WangFull Text:PDF
GTID:1528306932957199Subject:Data Science (Computer Science and Technology)
Abstract/Summary:PDF Full Text Request
Spatio-temporal data records the changes in the world,which is closely related to human life.Deep mining of massive spatio-temporal data can help humans continuously improve their way of life.Currently,breakthroughs have been made in deep neural network based spatio-temporal data mining algorithms,which have played a crucial role in fields such as smart transportation,earth sciences,and epidemiology,promoting social progress.However,deep learning based spatio-temporal data mining models usually require a large amount of high-quality labeled data for training.Firstly,taking spatio-temporal data prediction tasks as an example,in an ideal state,the data used for model training is uniformly distributed in the spatial or temporal domain.For instance,different areas in the same city are associated with similar sample sizes,and different time periods at the same intersection are associated with similar traffic flows.However,in actual scenarios,the available data is often sparse in the spatial or temporal domain.Secondly,taking spatio-temporal data classification tasks as an example,in an ideal state,the data used for model training is uniformly distributed in the inter-class sample size.For instance,the occurrence of different accidents is roughly similar.However,in actual scenarios,the available data is often characterized by a long-tailed distribution.Thus,this problem of data distribution imbalance is widely present in existing spatiotemporal data mining tasks,and it can cause typical spatio-temporal data mining models to fail to achieve good generalization performance in actual scenarios.Therefore,improving the performance of spatio-temporal data mining algorithms in imbalanced data distribution scenarios has become an urgent issue in this field.This dissertation focuses on the issue of imbalanced data distribution in spatiotemporal data mining,with the aim of improving the generalization performance of spatio-temporal data mining models in practical scenarios.Specifically,it addresses the problems of data sparsity in spatio-temporal data prediction tasks and data long-tailed in spatio-temporal data classification tasks.The dissertation delves into research on spatio-temporal data mining algorithms under imbalanced data distribution.The main research content and contributions of this dissertation include:1.For the scenario of sparse and multi-source data distribution in spatio-temporal data prediction,this dissertation proposes a multimodal fusion approach for sparse spatio-temporal data prediction at the model level.Specifically,we analyze the spatial sparsity and multi-source nature of spatio-temporal data and further propose a spatiotemporal multimodal fusion network for multi-source sparse spatio-temporal data scenarios.To address the problem of the inability of existing methods to comprehensively model multi-source sparse spatio-temporal data,this network considers the modeling problem from the perspective of modal complementarity.It use a hierarchical feature fusion approach to deeply fuse all single-modal features,so as to alleviate the problem of modal sparsity through modal semantic complementarity.To further ensure that the features extracted by the model in sparse data scenarios can fully represent the true data distribution,we design two plug-and-play modules,a self-correcting transformer module and a unified feature interaction module,for the aforementioned multimodal fusion model.We evaluate the proposed spatio-temporal multimodal fusion network as well as the plug-and-play modules on two real housing transaction datasets from New York City and Beijing.Extensive experimental results have shown that compared to the optimal baseline,the spatio-temporal multimodal fusion network achieved a 20%and 25%improvement in RMSE and MAPE,and the use of plug-and-play modules further improves the generalization performance of the fusion network.2.For the scenario where data is permanently missing in spatio-temporal data prediction,this dissertation proposes a diffusion generation strategy for sparse spatiotemporal data prediction at the data level.Specifically,we take typical urban traffic as the underlying scene in spatio-temporal data mining,and propose to use taxi data with high similarity to intersection monitoring data as third-party data based on the data similarity hypothesis,to construct and train a spatio-temporal diffusion network for permanent missing spatio-temporal data.Taking into account the issue of low quality in existing completion methods,this network uses the idea of diffusion generation to complete simulated obstructed taxi traffic data.Furthermore,we transfer the noise predictor of this model to sparsely distributed monitoring data,achieving accurate completion of traffic flow data at intersections without deployed monitoring devices.Experimental results on two real-world datasets show that our proposed spatio-temporal diffusion network can infer traffic flow data at intersections without deployed monitoring devices,effectively overcoming the problem of permanent spatio-temporal data loss caused by extremely sparse distribution.It achieved a stable improvement of 6%in inference accuracy.3.For the scenario of long-tailed data distribution in spatio-temporal data classification,this dissertation proposes a research on the contrastive learning strategy for long-tailed spatio-temporal data classification at the model level.This is an earlier work that focuses on the long-tailed spatio-temporal data classification.Specifically,with the goal of achieving effective long-tailed spatio-temporal classification,we first analyze the similarities and differences between long-tailed spatio-temporal data classification and conventional long-tailed recognition,and propose a joint training feature space rebalancing strategy from both the representation perspective and the data perspective.In this strategy,we designed a balanced contrastive learning module to learn a more balanced feature space and proposed adaptive temporal augmentation to rebalance the potential feature space in the temporal dimension.We propose three derived long-tailed datasets for the long-tailed spatio-temporal data classification problem,and conduct indepth experiments on these datasets,demonstrating the proposed method is superior to other baselines and can achieve a performance improvement of up to 8%.4.For the scenario of long-tailed data distribution in spatio-temporal data classification,this dissertation proposes a general data augmentation method for long-tailed spatio-temporal data classification.This first study conduct a theoretical analysis on the gains brought by traditional augmentation strategies for long-tailed learning.Specifically,we first sample spatio-temporal data as ordinary samples and observe that data augmentation can lead to further imbalance in the long-tailed distribution.Inspired by this,we propose a dynamic optional data augmentation based on curriculum learning,allowing the model to dynamically assign augmentation types for each class during the training process and adjust these types based on feedback from each training epoch.Extensive experiments on multiple long-tailed recognition benchmarks demonstrate the dynamic optional data augmentation can achieve comprehensive accuracy improvement on three standard datasets,while overcoming the imbalance issue caused by data augmentation.The study also showed the flexibility and generalization of the dynamically selected data augmentation.By conducting a thorough investigation of the above scenarios,this dissertation achieves spatio-temporal data mining under the scenario of imbalanced data distribution,effectively overcoming the obstacles to model performance caused by data sparsity and long-tail,and improving the generalization performance of deep spatio-temporal data mining models.
Keywords/Search Tags:Imbalanced data distribution, Spatio-temporal data mining, Multimodal fusion, Diffusion model, Deep long-tailed learning, Data augmentation
PDF Full Text Request
Related items