Image analysis and metadata extraction for document search

Posted on:2009-10-13

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:Lu, Xiaonan

Full Text:PDF

GTID:2448390005454860

Subject:Computer Science

Abstract/Summary:

This thesis work is mainly focused on two problems related to document search. The first problem is the analysis and utilization of images contained within documents for document retrieval applications. The second problem is the metadata generation for scanned scientific documents at web based archives.;Images are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the images. This thesis proposes an integrated document retrieval schema utilizing both text and image information. As the initial step in enabling integrated document search, images are categorized into a set of pre-defined types. Several categories of images have been defined based on their functionalities in scholarly articles. A machine-learning-based approach has been proposed to categorize images using both global features and part features extracted from content of images. After categorization of images, algorithms have been designed to analyze two common types of images in documents: 2-D plots and diagrams. A thin line analysis based algorithm has been designed for extracting numerical data from 2-D plot images. An integrated algorithm has been designed for symbol recognition in diagrams. The proposed approach has been evaluated on a test bed document set collected from the CiteSeer scientific literature digital library and other sources. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real world use.;Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. This thesis work tackles the problem of extracting metadata from scanned volumes of journals. The goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. Methods have been designed for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from scanned volumes. The automatic metadata generation software has been developed and integrated into an operational digital library, the Internet Archive, for real world usage.

Keywords/Search Tags:

Document search, Metadata, Digital library, Images, Integrated

Related items

1	Library Digital Resource Integration And Practice In New Environmental
2	Research On Digital Library And Automatic Document Classification
3	Research On Ditributed Search Engine Of Digital Library Based On P2P
4	Xml Technology Is Applied Research In The Digital Library System
5	Application And Implementation Of Digital Library System Of CPC Tianjin Municipal Committee Party School
6	Research And Implement Of OAI-based Integrated Information Retrieve System
7	A Study On The Application Of Metadata In Digital Library
8	Enhancing a domain-specific digital library with metadata based on hierarchical controlled vocabularies
9	Exploration And Application Of Metadata Technology In Building Of Digital Library
10	Application Of Metadata Construction On Digital Library Business