Font Size: a A A

A Study On Feature Design Algorithms With Application To Image Annotation And Information Extraction

Posted on:2016-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H JiangFull Text:PDF
GTID:1108330503953431Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of WWW and multimedia technologies, images and texts have become indispensable information carriers. There are a large scale of images and texts data generated everyday on the Internet. Hence, how to effectively manage them becomes an urgent task. Facing these voluminous data, traditional method of human labeling is not only time-consuming but also label intensive. Aiming at the problems exsited in current image and text understanding algorithms, this paper proposes several machine learning algorithms to make image and text data management more intelligent. In this paper, image understanding mainly focus on image classification and automatic image annotation, while text understanding stands for information extraction. In fact, no matter for image semantic analysis or information extraction from texts, they both can be deduced to the problem of pattern recognition. Images and texts are just the media, and low level features are the real language that computers can understand. In other words, this paper concentrates on a sole topic, which is how to build a better mapping function from low level features to high lever semantics by using machine learning algorithms. The main work of this paper is as follows:1. A multi-scale low level feature fusion framework is proposed. The algorithm first uses the traditional BOW model at different scales to extract densly sampled visual words. Then topics from different scales can be obtained by using p LSA algorithm. In the next, this paper propose to concatenate the features into a new feature vector. In the experiment, by comparing with single scale feature extraction methods, the proposed method demonstrates its superiority.2. A training data optimization method is proposed. By dense sampling and feature extraction, lots of feature points will be generated. Actually, every image contains many repeated features and outliers, where redundant and noisy information are included. Hence, if we use all the feature points to train SVM(Support Vector Machine) classifiers, it will be time-consuming and moreover, it may deteriorate the classification performance. On the contrary, if we can select some representative points as the training data for SVM, it will not only accelerate the training speed but also improve the accuracy. For this reason, this paper proposes to first use LVQ technique to do training data optimization, and then use SVM to do image annotation. From the experiments, we found that AP-based LVQ is better than SOM-based LVQ in terms of both the representativeness of the chosen feature point and the classification precision of the trained SVM.3. A locality-constrained low-rank(LCLR) coding algorithm is proposed for image classification. LCLR can exploit the manifold of feature space by using joint coding and locality constraint. Compared to other low-rank based paradigms, LCLR uses locality regularization term instead of widely used ? 1 norm. Extensive experiments have been carried out to demonstrate that LCLR outperforms other state-of-the-art algorithms.4. A fully unsupervised information extraction algorithm has been proposed to automatically discover comparable entities in the query log. By using the proposed algorithm on 1 billion search queries, we finally built a comparable entity graph containing 630,121 vertexes(entities) and 300 million edges. In the experiment, we examined the proposed algorithm and the built graph thoroughly. As far as we know, the graph is the largest topology in terms of the relation comparability.5. In the research of information extraction(IE), text corpus is given in advance. In this sense, people pay more attention to the IE algorithm rather than the importance of the given text corpus. In fact, the quality of the text corpus will greatly affect the performance of the IE algorithms. In order to improve the performance of current IE algorithms, this paper presents an algorithm to construct a large-scale and high-quality text corpus. We rank all the web pages on the Internet by their knowledge, and then extract knowledge from top to down. From the experiments, we found that the performance of both relation specific information extraction and open information extraction algorithms can be boosted.
Keywords/Search Tags:open information extraction, relation specific information extraction, random walk algorithm, page rank, image automatic annotation, image classification, bag of visual words, low-rank coding
PDF Full Text Request
Related items