Font Size: a A A

Research On Visual Semantic Reprensatation And Its Application In Automatic Scene Description Generation System

Posted on:2012-02-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:H P LiuFull Text:PDF
GTID:1228330374999589Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Lexical semantic analysis is an important research topic in Natural Language Processing (NLP). In most existing theories and technologies, representations of semantics are based on relations between words or concepts. Briefly, it is to explain one word conceptually by some other words or relations with other words. This type of semantics has been widely applied in many fields such as machine translation and question answering systems. However, it can do little in some other tasks, for example, situated human-machine interaction, automatic text description for images and so on. The main reason is that the linguistic words have no relations with perceptive information in this type of semantics. To link language and perception, a new trend in NLP research appears to imitate human language acquisition mechanism. And new computational models are built to learn semantics from various sensorimotor information, among which, vision cognition and its relationship with language ability have gained special attention. This task is named Vision Grounded Language Acquisition. Language grounding research can extend original mono-modal language representation to vision-language association based method. Thus language concepts will be associated with sensorimoter information in order to realize the human-machine interaction under real circumstances.In another aspect, along with the fast developments of computer science and internet, multimedia informations such as various documents, images and videos are dramatically increasing. The demands that process these massive non-structure informations with computers become more and more urgent. In such a background, this dissertation mainly focus on the association process, representation methods and acquirement algorithems between visual information and langauge information. The main works and innovations are summarized as follows: 1. Research on association between visual features and static lexiconsAs the first acquired lexicons by humans, nouns and adjectives can be directly refered to the sensored features of objects in the real world. Their visual information can be included in static images. Thus they can be classified as static lexicons. In this dissertation we borrow the idea of children language acquisition and build a learning model ViMac to automatically associate the informations between visual modal and langaugae modal. ViMac is constructed by four modules, which are dual-modal information preprocess and feature extraction, Hellinger distance based semantic association vector computation, hybrid metric based word clustering and multi-Hellinger distance based visual feature selection. Through these modules the correspondances between visual features and language lexicons can be sorted from the bottom to the top. In above learining modules, the distance measuring the divergences between distributions of visual features is the key to the learning effects. Thus the different learning results in semantic association vector computation and visual feature selection are compared with those distances among one-dimensional Kullback-Liebler distance, one-dimensional Hellinger distance, multi-dimensional Kullback-Liebler distance and multi-dimensional Hellinger distance. Experimental results prove that one-dimensional Hellinger distance and multi-dimensional Hellinger distance can significantly improve the association results between visual features and static lexicons.2. Research on semantic representation schemes and language output algorithmAfter the association between static lexicons and visual features, lexical semantics can have various representations in visual sub spaces. When ViMac uses these acquired lexical semantics to generate language description for images, they will have different effects to output algorithms and describing performances. Thus this dissertation proposes three visual semantic representations on static lexicons, which are Gaussian based representation, KNN based representation and Core-based representation. In these representations, Core-based method benefit from the cognitive science research that human language representations can be divided into two parts of center and edge. Based on Core-based representation a novel compound generation method is proposed. Compound method can overcome the data sparse problem, generate the unlearned compounds from training sets and output corresponding descriptions for new scenes during test. The automatic evaluation on output sentences is based on BLEU technology. The comparisional experiments among three representations based output algorithms are implemented. The results show that compound generation method can generate the unseen new words from predefined word set, overcome the subjective variabilites existing in training data and significantly improves the computation efficiency. Thus its overall performance is far superior than other two algorithms. Meanwhile, the experiments results on Compound method itself also reveals the different rules on the usages between core-words and compounds by human.3. Reserch on visual semantic representation of dynamic lexicons.As the later learned lexicons in human language process, verb has certain degrees of complexities. The explainations to its meanings need the participation of the basic lexicons such as nouns and adverb. The semantic of verb often refers to a action event, which can be included in a dynamic video. Thus the verb can be classified into dynamic lexicon. Aiming on the verb complexity, the structure of verb semantic representation based on frame semantics is first defined, which includes two parts of frame and arguments. In this representation, the frame can be regarded as a cognitive model that organizes situational knowledge related to the linguistic context. Then a detailed description can be realized through the selection of various members categorized by different arguments. With this representation a video information based verb semantic acquisition model ViMac-V is constructed. The informations in visual modal and language model in ViMac-V are both complex than they are in ViMac, especially on extraction of frame and arguments from language modal. ViMac-V first uses the method based on the cooccurrences between visual features and lexicons for the selection of the basic classification words. Then a hybrid word measurment based on POS information and minimal edit distance is used for the argument lexicons classification. After the acquisition of each group of argument lexicons, bi-gram model is used for extraction of verb frames. Experimental results prove the effectiveness of extraction on frames and arguments by ViMac-V. There are total5groups of frames and4groups of arguments (62lexicons) related to7verbs are learned by ViMac-V4. Research on the association between video information and semantic representation of dynamic lexiconsIn ViMac-V, the association between video information with frames and arguments is realized through the construction of Self-organizing network groups. The association between verb frames and cognitive perspectives highlighted by video information is realized through frame activation mechanism based on Learning Vector Quantization algorithm. The arguments lexicons dominated by verb frames are categorized in visual spaces through SOM network training, neuron clustering and language concept acquisition. SOM connects the distribution of high-dimensional video features and argument lexicons. Each SOM can be linked by frame into various sub networks to express different verb semantics. The completed ViMac-V can be set up on the MT-AR robot platform. MT-AR uses the camera and speech output to extend the visual and langauge abilities of ViMac-V. Meanwhile, a verb selection algorithm based on the cooccurrences between frame and argument is proposed to generate the natural language descriptions which are more closed to the video scenes. In the experiments for description output test show that the visual representation acquired by ViMac-V can be used to generate correct natural language description for small ball movement events under complex real circumstances.
Keywords/Search Tags:Semantic representation, Semantic acquisition, Scene decription, Compounds generation, Verb frame activation, Verb arguments categorization
PDF Full Text Request
Related items