Font Size: a A A

Research On Image Parsing Database And Computational Mechanism For Object Recognition

Posted on:2011-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YangFull Text:PDF
GTID:1118360305992058Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The importance of having an image/video database containing ground truth annotated by humans is widely recognized by the computer vision community. As the first step for database construction, we developed a novel annotation tool, interactive image parser (IIP), which integrates several functional modules designed for specific tasks. We show that by properly combining these functions, the tool can perform customized annotation tasks blending all kinds of information. For a scene image, we provide all kinds of visual information correspondingly in scene-object-low middle level manner through a hierarchical image parsing process. To the best of our knowledge, much of the information annotated here has not appeared in previous databases. In addition, And-Or Graph knowledgebase is used to organize and summarize labeled visual knowledge in a universal way.Based on the image/video frame parsing database, we present an image parsing to text generation (I2T) framework that generates natural language descriptions from image and video content. This framework converts the harder content based image and video retrieval problem into an easier text search problem. The proposed I2T framework follows three steps. (ⅰ) Image parsing database. Input images/video frames are decomposed into their constituent visual patterns through image parsing engine IIP, which outputs a scene as a parse graph representation, in a spirit similar to parsing sentences in speech and natural language. (ⅱ) The parse graphs are converted into semantic representation using the Web Ontology Language (OWL) format, which is a formal and unambiguous knowledge representation. (ⅲ) A text generation engine converts the semantic representation into text report described by natural language. Success of the above framework relies on two knowledge bases. The first one is a visual knowledge base that provides top-down hypotheses for image parsing and serves as an image ontology for translating parse graphs into semantic representations. The core of the visual knowledge base is an And-Or graph representation. It entails vocabularies of visual elements including pixels, primitives, parts, objects and scenes and a stochastic image grammar specifying compositional, spatial, temporal and functional relations between visual elements. The second knowledge base is a general knowledge that interconnects several domain specific ontologies in the form of the Semantic Web. This knowledge base further enriches the semantic representation of visual content with domain specific information. Finally, we demonstrate a case study in video surveillance, an end to end system that infers video events and generates natural language descriptions of video scenes. Experiments with maritime and urban scenes indicate the feasibility of the proposed approach.With the powerful support of our visual database, we present a numerical study of the bottom-up and top-down inference processes in hierarchical models using the And-Or graph as an example. Three inference processes are identified for each node A in an And-Or graph:the a process recognizes node A directly based on image features, theβprocess computes node A by bottom-up binding of its child node(s), and theγprocess predicts node A from its parent node(s) in a top-down manner. We isolate and train theα,β,γprocesses separately by some specific methods. Information contributions of each process are evaluated individually through computer algorithm and human perception testing. Furthermore, we integrate the three processes explicitly for robust inference and propose a greedy pursuit algorithm for object detection/recognition under the Bayesian framework. We choose junctions in low-middle level vision and human face, car in high level vision as experimental data, results show that (ⅰ) the effectiveness of theα,β,γprocesses depends on the scale and occlusion conditions of the interested object instances, (ⅱ) in general, the a process for high level object is stronger than the other two processes, whileβprocess work much better for low-middle level elements, (ⅲ) the integration of the three processes improves the performance greatly.
Keywords/Search Tags:Image/video annotation database, And-Or graph, object detection/recognition, Image parsing, Semantic web, Natural language generation, Feed forward/feedback computing process, Information contribution
PDF Full Text Request
Related items