Font Size: a A A

Research On Image Sentence Annotation Based On Feature Learning And Tag Refinement

Posted on:2017-12-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B ZhangFull Text:PDF
GTID:1318330512454959Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many kinds of heterogeneous media including images, texts, and videos are widely distributed over the Web. There are lots of implicit semantic correlations between these heterogeneous media. As we all know, thoroughly analyzing and using these correlations may help to better organize, manage and retrieve these multimedia resources. Recently, E-commerce develops rapidly, which provides a wide platform for exploring the semantic correlations between the heterogeneous data:annotating a paragraph of text description on a product image, which is also called image sentence annotation. The text description contains a wealth of semantic information and coherent syntactic structure, which can help to build more stable cross-media correlations between the images and the texts. The research significances of the thesis including:1) it simultaneously focuses on the research achievements coming from both the computer vision (CV) field and the natural language processing (NLP) field. This makes the research methods (or models, or algorithms) between these two fields integrate with each other, which will light up the flame of people's thinking and promote the theoretical research on a right way; 2) it helps to change the data management mode of the current E-commerce websites:Based on the automatic image sentence annotation, only a few manual revisions are added into the generated text to publish (or update) lots of products effectively; 3) it helps to promote the recall ratio of the image retrieval, which will provide the better human-computer interactions for the users'experiences.Three key problems still remain in the image sentence annotation (product-oriented). Firstly, image feature learning method is too simple. Known to all, how to recognize the key visual characteristics of the images is an important premise of the image sentence annotation. The image features with better discriminant abilities and interpreting abilities must be extracted to describe the key visual content of the images'. Secondly, the coherences of the generated sentences are very poor. Known to all, both the readabilitie and the intelligibilities of the generated sentences depend mostly on their coherences. Considerations must be given to both the semantic correlations and the syntactic mode relations between different words to generate those coherent combined semantic information (CSI). Thirdly, noisy interferences are very serious in the image sentence annotation. Suppressing noisy interferences can promote annotation performance largely. Therefore, many considerations also must be given to both the semantic information noises and the syntactic structure noises to design a good noise suppression strategy. Based on the above analysis, my research work is introduced from the following three aspects:1. Feature learning is an important premise of the image sentence annotation. A new feature learning strategy is presented based on the EMK (Efficient Match Kernels) model and the KDES (Kernel Descriptors) model:1) a new image feature named SIFT-EMK is extracted firstly based on the EMK model. A new image feature named MKF (Multiple Kernel Feature) is obtained in turn by fusing the shape feature, the texture feature, and the SIFT-EMK feature in the multiple kernel learning model; 2) The Grad-KDES feature, Shape-KDES feature, and Color-KDES feature are extracted respectively based on the KDES model. Several new image features named MK-KDES-J (J=1...4) are obtained by fusing these KDES features in the multiple kernel learning model. Experimental results show that both the MKF feature and the MK-KDES-1 feature depict the key visual content of the images accurately, which builds an important basis for generating the coherent sentences.2. Known to all, both the readabilities and the intelligibilities of the generated sentences depend mostly on their coherences. More importantly, the coherence is also a key index for evaluating the annotation model. Two NLG (Natural Language Generation) models are designed to create many modified phrases (N gram word sequences) which both accurately and coherently describe the images'content:1) a SCCM (Semantic Correlation Computing Model) is constructed based on the TF-IDF feature and the visual similarity measurement between images. The key words better describing the images'content are summarized by the SCCM. N-gram model is used in turn to restrict both the semantic correlations and the syntactic mode relations between different words. The modified phrases containing a wealth of semantic information and coherent syntactic structures are generated easily; 2) The WSBB (Words Sequence Blocks Building) model is designed too. The model evaluates the semantic correlations between different words based on the word embeddings and the COS metric. Meanwhile, it restricts the syntactic mode relations between different words based on the predefined syntactic mode constraints. Many valuable N gram word sequences (N=1...4) are output by the WSBB model. Experimental resuts show that both the N-gram model and the WSBB model help to generate those coherent phrases (word sequences). The phrases (word sequences) are regarded as the core components of the generated sentences.3. There are lots of noises in the image sentence annotation. Noises include the semantic information noises and the syntactic structure noises. A new annotation model based on the TR (Tag Refinement) stategy and the ST (Syntactic Tree) is introduced to deal with the noises:1) A multilayer TR strategy is designed firstly: ? the SCCM is modified by replacing the TF-IDF feature with the AR feature to strengthen the key words' weights. It is called the first layer TR; ? the above WSBB model is modified by setting a semantic correlation threshold y to refine the key words summarized by the SCCM. It is called the second layer TR; 2) a new sparse word embeddings is built based on the TC (Term-Context) relations between different words (It is called the TC-based embeddings). The PPMI (Positive Pointwise Mutual Information) metric and the PDI (Positive Distance Information) metric are used to evaluate the semantic correlations and the syntactic mode relations between different words respectively. Many valuable N gram word sequences are generated accordingly by the WSBB model; 3) the syntactic tree is designed to merge these N gram word sequences recursively into a complete sentence.4) A new compact DWE (Distributional Word Embeddings) is created by a deep learning model. And the above sparse TC-based embeddings is replaced with the compact DWE. With the help of the DWE, the semantic correlations between different words are evaluated more accurately. Experimental results show that the mullayer TR strategy suppresses the semantic information noises effectively while the ST suppresses the syntactic structure noises effectively. Moreover, the introduction of the DWE helps to suppress the noises in the procedure of generating the N gram word sequences.Main innovations of the thesis are listed as:Innovation 1:Image feature learning strategy is executed firstly based on the EMK model and the KDES model. The MKL model is used accordingly to complete the feature late-fusion procedure. Finally, both the MKF feature and the MK-KDES-1 feature better interpret the key texture and the key shape characteristics of the images are obtained easily.Innovation 2:The SCCM is created to summarize the key words better describing the images'visual content. Moreover, the N-gram model is applied in turn to constrain the semantic relevance and syntactic mode between different words. Finally, those modified phrases with abundant semantic information and coherent syntactic structures are achieved by using the N-gram model. They are the core components of the sentences.Innovation 3:A new annotation model integrating the tag refinement strategy and the syntactic tree is presented for the image sentence annotation.1) The multilayer TR strategy is proposed to better refine the key words summarized by the SCCM and suppress the semantic information noises; 2) The WSBB model is designed to generate the corresponding N gram word sequences, which depict the key visual content of the images; 3) The syntactic tree is designed to combine all the N gram word sequences together recursively. Finally, a complete sentence with abundant semantic information and correct syntactic structure is created and annotated on the image. As the restuls, the syntactic structure noises are suppressed greatly as well as the annotation performance is boosted largely.
Keywords/Search Tags:Image Sentence Annotation, Feature Learning, Tag Refinement, Efficient Match Kernels, Kernel Descriptors, Syntactic Tree
PDF Full Text Request
Related items