Font Size: a A A

Multimedia Feature Representation With Semantic Prior Constraints

Posted on:2019-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:F D NianFull Text:PDF
GTID:1318330542497669Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multimedia computing is the cornerstone of the ultimate goal of advanced artificial intelligence such as robots with human-like multi-modal awareness,it has very important academic significance and practical application value.The key to solve the problem of multimedia computing is to construct powerful feature representation models for different modal data,that is,multimedia feature representation.Conventional hand-craft based feature representation model has limited ability to represent middle-level structure and high-level semantic information of data and cannot break through the "semantic gap".In recent years,the rise of deep learning draws on some of the characteristics of the human nervous system and uses machine learning technology,has strong nonlinear fitting ability,which is the most promising research direction to solve the problem of multimedia feature representation.However,most of the existing deep feature representation models reply too much on the big data itself for fully data-driven model learning,neglecting the knowledge obtained by humans through tens of thousands of years of evolution and survival experience.Few studies involve how to use human knowledge as a semantic prior to guide the process of learning and training of deep feature representation models.This thesis mainly discusses the deep feature representation of multi-modal data such as images,texts and videos under the semantic prior constraints in multimedia computing tasks and its applications.It fully draws on some of the research results in the field of advanced artificial intelligence and the research on improving the effectiveness of multi-modal feature representation models through the semantic prior constraints has been carried out.The theoretical research and applied research in this thesis are closely integrated.Theoretical research serves the needs of practical applications(such as visual key-point location,automatic generation of video captioning,multi-modal knowledge analysis,etc.),and applied research leads the research direction of theoretical algorithms.The main achievements and contributions of this thesis include the following aspects:1.This thesis presents a novel image deep feature representation learning method based on visual geometric information prior.The method firstly models the human-perceived geometric information in the image through the loss function,then guides the training process of the image deep feature representation model through the constraint of this loss function.The resulting features can be used to predict facial key points or to recover the motion of non-rigid objects.The proposed method can significantly improve performance without increasing the complexity of existing models.2.In this thesis,we propose a video deep feature representation learning method based on video attribute prior.This method firstly proposed an effective method of constructing video attribute dictionary,and then proposed an efficient method of representing the video middle level features,which can represent a video sequence in a single image.Based on the above steps,this thesis successfully transforms the complex video attribute representation learning problem into a relatively easy to solve image multi-label classification problem.Finally,by improving the encoding structure of sequence-sequence learning,we introduce the video deep feature representation with attribute prior into the automatic generation framework of video captioning,which significantly improves the semantics of generated text sentences.3.A new multi-modal deep feature representation learning method based on cross-modal knowledge association is presented in this thesis.This method can learn deep knowledge representation of multimodal data from network data directly.This thesis first proposes a complete solution to automatically mine large-scale structured multi-modal relational datasets from heterogeneous,heterogeneous,and unstructured networks.Then,based on the knowledge prioritization of cross-modal data,this thesis presents a bi-enhanced multi-modal knowledge representation learning deep model,which combined with cross-modal alternative training optimization can utilize the data itself instead of relying on specific tasks to unify the knowledge representation of the multi-modal data into semantic space.The quantitative and qualitative analysis of a large number of experimental results shows that modeling the semantic prior which reflect human knowledge and using it as supervisory signal to train deep feature representation model can significantly improve the representation ability of multimodal data features.Thus significantly promoting the development of related multimedia computing tasks.
Keywords/Search Tags:multimedia computing, deep learning, semantic modeling, feature representation
PDF Full Text Request
Related items