Font Size: a A A

Cross-Modality Semantic Integration and Robust Interpretation of Multimodal User Interactions

Posted on:2011-08-01Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Hui, Pui YuFull Text:PDF
GTID:2448390002969629Subject:Engineering
Abstract/Summary:
Multimodal systems can represent and manipulate semantics from different human communication modalities at different levels of abstraction, in which multimodal integration is required to integrate the semantics from two or more modalities and generate an interpretable output for further processing. In this work, we develop a framework pertaining to automatic cross-modality semantic integration of multimodal user interactions using speech and pen gestures. It begins by generating partial interpretations for each input event as a ranked list of hypothesized semantics. We devise a cross-modality semantic integration procedure to align the pair of hypothesis lists between every speech input event and every pen input event in a multimodal expression. This is achieved by the Viterbi alignment that enforces the temporal ordering and semantic compatibility constraints of aligned events. The alignment enables generation of a unimodal paraphrase that is semantically equivalent to the original multimodal expression. Our experiments are based on a multimodal corpus in the navigation domain. Application of the integration procedure to manual transcripts shows that correct unimodal paraphrases are generated for around 96% of the multimodal inquiries in the test set. However, if we replace this with automatic speech and pen recognition transcripts, the performance drops to around 53% of the test set. In order to address this issue, we devised the hypothesis rescoring procedure that evaluates all candidates of cross-modality integration derived from multiple recognition hypotheses from each modality. The rescoring function incorporates the integration score, N-best purity of recognized spoken locative references (SLRs), as well as distances between coordinates of recognized pen gestures and their interpreted icons on the map. Application of cross-modality hypothesis rescoring improved the performance to generate correct unimodal paraphrases for over 72% of the multimodal inquiries of the test set.;We have also performed a latent semantic modeling (LSM) for interpreting multimodal user input consisting of speech and pen gestures. Each modality of a multimodal input carries semantics related to a domain-specific task goal (TG). Each input is annotated manually with a TG based on the semantics. Multimodal input usually has a simpler syntactic structure and different order of semantic constituents from unimodal input. Therefore, we proposed to use LSM to derive the latent semantics from the multimodal inputs. In order to achieve this, we characterized the cross-modal integration pattern as 3-tuple multimodal terms taking into account SLR, pen gesture type and their temporal relation. The correlation term matrix is then decomposed using singular value decomposition (SVD) to derive the latent semantics automatically. TG inference on disjoint test set based on the latent semantics achieves accurate performance for 99% of the multimodal inquiries.
Keywords/Search Tags:Multimodal, Semantic, Test set, Input
Related items