Font Size: a A A

Multimodal Multimedia Data Analysis And Key Technology Research

Posted on:2015-12-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Z NieFull Text:PDF
GTID:1108330485991729Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of information technology in recent years, media data has changed from textual data to multimodal data with vivid expression and rich content such as image, video, audio, etc. With the development of internet and the extend of digital collection devices, multimodal data is presenting a massive growth trend. It is a big challenge to handle the saving, transport, application and analysis of this data. In recent 20 years, human have spent a lot of time and money on multimedia data analysis and understanding in order to fully utilize the digital information for human living. For example,Google proposed a text retrieval system at an early age, which can retrieval useful information according to the query words for users. Zhihu( http://www.zhihu.com/) focuses on the research of QA(question and answer). It provides accuracy answers to users in a short time. Baidu provides an image retrieval system according to the feature of image.However, these applications focused on the analysis of single model data. In recent years,these traditional analysis methods and retrieval results of data are not suitable to users’ requirements, when we face the large size of multimodal data. Thus, more and more researchers pay more attention to analysis of multimodal data. In this paper, we make some depth researches based on three key problems in analysis of multimodal data.First problem: the image semantic extraction is a key technology of mapping relation between the image data and textual data mining. With the development of mobile internet, image generation is often accompanied with relevant geographic information, textual information and associated model data. Utilizing the relation among multimodal data to solve the image semantic extraction problem is a key research orientation. In this paper,according to the geographic information of image, we proposed a novel cross-domain learning algorithm, which can effectively address the mapping relation between image data and textual data. First, we collected large amount of textual information according to the location information of images, which can limit the region of image semantic. Second,these textual information are used to collect image dataset as auxiliary domain. Finally,we apply the labeled image from source domain and auxiliary domain respectively for training classifier for each tag to extract semantic information of unlabeled image. The experimental results also demonstrate the superiority of our proposed method.Second problem: the video semantic extraction is a key technology of mapping relation between the video data and textual data mining. Video can be seen as a set of images. Thus, we should fully utilize the relation of image concepts in temporal space to handle the analysis and mining of video data. In this paper, we proposed a new data association method to find the relation between video concept and video event based on the graph matching method. First, we apply the Part-based model to detect location information of each person in each frame. In this process, we also apply on-learning method to train classifier for each tracking object which can update the training samples in order to make each individual classifier is suitable for the recent status of tracking object. Then, the spacial and temporal information of tracklet is utilized to construct the graph model in order to formulate the data association problem into one graph matching problem. Finally, the data association will generate trajectory for each tracking object.These trajectories are used to judge the semantic information of video.Third problem: the multimodal data semantic extraction is a key technology to achieve the integrated application of information. When we face massive multimodal media data, according to the user’s specific needs, multimodal media data can form a certain intersection. It is a powerful support for user’s effective information generation to build multimodal semantic extraction algorithm according to some certain goals(location,person, object, etc). Considering the relationship of the multimodal data on the semantic,we proposed a novel multimedia semantic extraction model based on location information. First, we collect a large amount of venue multimedia data from Foursquare(image,text, video, location information, etc). Then, we utilize the relation among multimodal data to construct graph model. Finally, the graph segment method is used to extract venue semantics. Experimental results demonstrate the effectiveness of the proposed method.Based on the analysis of the data characteristics, we proposed a series of innovative algorithms to solve the problems of multimodal data in practical application. The main contributions of our work are summarized as follows:We proposed a novel web image automatic annotation method based on cross-domain learning method, which can automatically label the concepts of image.We proposed a modified object detection algorithm which can effectively handleocclusion problem in detection process.According to the graph matching problem, we successfully formulated this prob-lem into the Rayleigh maximum entropy problem and utilized the gradient descentto handle this optimization problem.We proposed a novel venue semantic model to handle the ‘semantic gap’ prob-lem on our collected dataset. The final experimental results also demonstrate theeffectiveness of our method.
Keywords/Search Tags:Multimodal, Image Annotation, Video Concept Detection, Multiple Object Tracking, Semantic Extraction
PDF Full Text Request
Related items