Font Size: a A A

Multimodal Question Answering Over Structured Data With Ambiguous Entities

Posted on:2018-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:H D LiFull Text:PDF
GTID:2348330512490271Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,we have been witnessing profound changes in the way people satisfy their information needs.For instance,with the ubiquitous 24/7 availability of mobile devices,the number of search engine queries on mobile devices has reportedly overtaken that of queries on regular personal computers.In this paper,we consider the task of multimodal question answering,in which a user supplies not just a natural language query but also an image.Our system addresses this by optimizing a non-convex objective function capturing multimodal constraints.Our experiments show that this enables it to answer even very challenging ambiguous entity queries with high accuracy.One important aspect of this is that mobile device usage tends to favor other input modalities than the traditional keyboard and mouse interface,which in previous decades had quite clearly been the most prototypical input devices.Touch interfaces can directly substitute for some of the previous forms of interaction.Yet,typing on mobile devices can be cumbersome,especially on the go,and voice recognition is not always practical in noisy environments.Thus,we have become used to seeing shorter emails with Please excuse brevity-style disclaimers.At the same time,mobile devices open up new opportunities.We now have the ability to instantly snap pictures whenever we encounter something interesting in our daily lives.This has already led to simple photo sharing app startups such as Instagram being acquired for a reported $1 billion.Additionally,touch interfaces and stylus devices also make it easier to sketch things that we are looking for,enabling a novel but little-explored form of information search.Keeping in mind the popular notion of a picture being worth a thousand words,these new modalities may in fact be more than just an alternative.In some cases,humans may have difficulties formulating a natural language query specific enough to make the answer evident.We conjecture that in certain cases,additional multimodal input may aid the user in conveying their search intent to the question answering engine.In this paper,we consider the task of multimodal question answering(QA),in which a user supplies not just a natural language query but also an image to satisfy their information need.The image can be a photographic one or a human-drawn sketch.We focus on questions and answers that can be addressed using structured knowledge repositories.Our system tackles this challenging problem in multiple steps.First,we apply regular linguistic analysis methods to the natural language part of the query.Our experiments show that this component of our system alone already gives us competitive results comparable to those of previous systems.Subsequently,we draw on a novel algorithm based on optimizing a non-convex objective function with linear constraints.This allows us to jointly capture both linguistic and multimodal constraints in a single joint optimization problem.The answer final answers are then retrieved from the knowledge base.
Keywords/Search Tags:Question Answering, Multimodal, Multimedia Knowled
PDF Full Text Request
Related items