Automated generation of metadata for mining image and text data

Posted on:2007-04-26

Degree:Ph.D

Type:Dissertation

University:George Mason University

Candidate:Al-Shameri, Faleh Jassem

Full Text:PDF

GTID:1458390005485786

Subject:Computer Science

Abstract/Summary:

Recent years have witnessed an explosion in the amount of digitally-stored data, the rate at which data is being generated, and the diversity of disciplines relying on the availability of stored data. Massive datasets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. Massive datasets are collected routinely in a variety of settings in astrophysics, particle physics, genetic sequencing, geographical information systems, weather prediction, medical applications, telecommunications, sensors, government databases, and credit card transactions.; Data mining associated with massive datasets presents a major problem to the serious data miner. Datasets on the scale of terabytes or more preclude any possibility of serious effort by individual humans at manually examining and characterizing the data objects.; My research addresses the challenges of autonomous discovery and triage of the contextually relevant information in massive and complex datasets. The aim is to extract feature vectors from the datasets, which will function as digital objects and then effectively reduce the volume of the datasets.; I have developed an automated metadata system for automatically scanning the database for certain statistically appropriate feature vectors, recording them as digital objects, and subsequently augmenting the metadata with appropriate digital objects. The result is that the data miner can do a Boolean search on the augmented metadata and quickly reduces the number of objects to be scanned to a much smaller dataset.; Two datasets are considered in my research. The first dataset is text data, and the second dataset is remote sensing data.; The text data used in my research are documents from Topic Detection and Tracking (TDT) Pilot Corpus collected by Linguistic Data Consortium, Philadelphia, PA., which is taken directly from CNN and Reuters. The TDT corpus comprises a set of 15863 documents spanning the period from July 1, 1994 to June 30, 1995.; Four features are extracted from text dataset, topics feature, discriminating words feature, verbs feature, and bigrams feature. The four features were attached to each document in the dataset as digital objects, which help in retrieving the information related to each document on the dataset.; The remote sensing images used in my research consisted of 50 gigabytes of the Multi-angle Imaging SpectroRadiometer (MISR) instrument delivered by the Langley DAAC by the help of the MISR team at the Jet Propulsion Laboratory (JPL). The MISR instrument of NASA's satellite Terra is an excellent prototype database for demonstrating feasibility. The instrument captures radiance measurements which can be converted to georectified images.; In my research I developed a set of features part of it is based on Gray Level Co-occurrence Matrix (GLCM). Adjacent pairs of pixels (assuming 256 gray levels) are used to create 256 by 256 matrix with all possible pairs of gray levels reflected. Images with similar GLCM are expected to be similar images.; Some of these features are constructed based on the GLCM such as Homogeneity, Contrast, Dissimilarity, Entropy, Angular Second Moment (ASM), and Energy. Other computed features include Histogram-based Contrast, Alternate Vegetation Index (AVI), and Normalized Difference Vegetation Index (NDVI), are also taking into consideration as part of the features extracted on this research.

Keywords/Search Tags:

Data, Features, Text, Digital objects

Related items

1	Text Data Augmentation Technique Based On Field Features
2	Recognition of free-form three-dimensional objects in range data using global and local features
3	The Design And Implementation Of Text Data Acquisition System Focused On News Field
4	Extracting High-level Multimodal Features
5	Research On Spatio-temporal Data Visualization Based On Spatial Features Mining
6	Research On Video Surveillance
7	Extraction of Text Objects in Image and Video Documents
8	Research On Recognition Of Spam SMS Based On Binary Mixed Features Of Text Content
9	Effective, efficient retrieval in a network of digital information objects
10	Identifying similar objects in social networks and digital libraries