Font Size: a A A

Research On The Theory And Applied Technology Of Visual Compact Representation

Posted on:2024-10-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S ZhuFull Text:PDF
GTID:1528307373971049Subject:Computer Science and Technology
Abstract/Summary:
Multimedia content embodies the vast majority of digital information in the digital world,serving as an indispensable data carrier for people.Especially,the massive and diverse visual information needs high-quality and efficient perception and understanding methods,which realizes intelligent recognition,understanding and accomplishes tasks such as recommendation and retrieval for practical applications.This further supports the development of artificial intelligence to enhance productivity and life quality.The core of intelligent perception and understanding with computer vision lies in learning good representations from visual data,while compactness could benefit for a good representation.Therefore,visual compact representations are critical approach to improving multimedia understanding quality.Moreover,the storage and computational costs are essential bottlenecks of massive multimedia data,which can be effectively resolved through compact representation methods.Therefore,it plays the key role in promoting carbon neutrality goal.The quality of visual compact representations depends on both the accuracy of visual representation and the effectiveness of compression algorithms.To achieve these objectives,two core scientific problems should be addressed:(1)Key information preservation and compression;(2)Discrete results approximation and optimization.Current researches on visual compact representation are boosted by the rapid development of deep learning,resulting in an acceptable performance across various tasks.However,these two scientific problems are still under explored.The training objectives are based on empirical and heuristic ways,resulting in biased results.Consequently,existing models are lower than expectation when tested on large-scale and complex application scenarios.Therefore,this dissertation studies several visual compact representation methods with respect to these scientific problems from specific to general.It analyzes and improves upon visual hashing,vector quantization,and multi-codebook quantization,and effectively applies them to various applications,leading to the following innovative contribution:(1)In the field of visual hashing,this dissertation introduces a multivariate Bernoulli distribution hashing model to address the issue of biased optimization objective and gradient calculation in existing methods.Firstly,it unifies the optimization objectives of current methods by formulating them as inter-class distinctness and intra-class compactness,which effectively measures the lower bound of hashing model’s performance to preserve key information.Integrating it with existing models could essentially enhance performance.Additionally,this dissertation proposes a posterior estimation method to optimize hash codes in the form of multivariate Bernoulli distribution,which effectively models the correlation between bit values and achieves low-bias optimization.This method has been validated under fast retrieval scenarios,with negligible overhead.Furthermore,it can be easily combined with existing hashing models and boost their performance,demonstrating its significance in downstream applications.(2)For visual representation quantization methods,this dissertation presents a multivariate Gaussian mixture model to tackle the inadequate latent modelling in existing methods.Considering the complexity of visual information and the redundancies among different visual regions,visual representations exhibit strong correlations,but existing methods fail to model them effectively.This dissertation employs the means and covariances of a multivariate Gaussian mixture for above modelling and achieves more accurate estimation to remove redundancies.It thus benefits for rate-distortion trade-off.Moreover,a cascaded vector quantizer is proposed to significantly improve encoding and decoding efficiency,while its probabilistic design effectively prevents the model to fall in local optima.In the context of image compression,this method outperforms existing algorithms in terms of rate,distortion,and encoding/decoding latency,providing ample evidence of its effectiveness.(3)Regarding the general multi-codebook quantization paradigm for visual compact representations,this dissertation proposes an end-to-end joint optimization method for with multi-codebook structure to address the complexity issue stemming from manual designs in existing methods.Since multi-codebook quantization is an NP-hard problem,prior works rely on heuristic algorithms and various tricks,making large-scale applications challenging.This dissertation constructs a deep learning based encoder,leveraging a data-driven approach to enhance the compression and reconstruction capability.Additionally,it adopts an iterative gradient estimation strategy to accurately perform discrete gradient optimization on the encoder.In fast retrieval and visual representation reconstruction tasks,this method achieves or surpasses the performance of current state-of-the-art methods while exhibiting extremely high encoding efficiency.(4)As the multimedia understanding continue to expand in wider application scenarios,there remains considerable room for applying visual compact representations in general openset scenarios.The openset scenarios simulate the real-world challenges such as continuously updating semantic concepts and varied data sources with significant visual style differences.Existing methods encounter distribution shift issues in this scenario.And due to the nature of compact representations,even slight distribution shifts can lead to significant changes.Therefore,this dissertation devises sampling strategies and optimization objectives to formulate a cross-aligned contrastive learning training strategy,which enhances the generalizability to visual style variations.Secondly,it builds a secondorder codebook for compositional projection to improve the generalizability to unknown semantics.By organizing a benchmark for visual compact representations in openset scenarios,the proposed method demonstrate significant performance improvements,which effectively enhances the significance of practical application with visual compact representations.
Keywords/Search Tags:Visual Understanding, Compact Representation, Discrete Optimization, Metric Learning
Related items