Analysis of document encoding schemes: A general model and retagging toolset

Posted on:1991-05-24

Degree:Ph.D

Type:Dissertation

University:The Ohio State University

Candidate:Barnes, Julie Ann

Full Text:PDF

GTID:1478390017951219

Subject:Computer Science

Abstract/Summary:

Many document encoding schemes and software applications to process electronically encoded documents exist today. The plethora of schemes complicates the development of applications that must access documents in more than one representation. A uniform representation of electronic documents would greatly facilitate software development.;Unfortunately, the retagging of existing electronic documents is difficult, given the current development tools. The fundamental problem of distinguishing the markup from the text strings is complicated by problems such as context-sensitive markup, implicit markup, white space, and the matching of start and end tags. Lexical-analyzer generators such as Lex are based on formal models that are inadequate to handle these problems. Because of this, much of the retagging code must be written by hand.;Based on a generalization of these problems, we develop a new model for textual data objects with embedded markup. The new model for textual data objects is based on the relationships between markup and text strings. The model includes four classes of markup strings: symbol, nonsymbol, implicit segmenting, and explicit segmenting tags.;We propose a uniform representation called a Lexical Intermediate Form with the following lexical properties: (1) the tags are easy to distinguish from the text, (2) the tags are unambiguous, and (3) the tags are explicit. The LIF borrows its concrete syntax from the ISO standard SGML, but it is not encumbered with the SGML concept of document-type definitions.;Based on the model and the proposed LIF, we identify two steps in the retagging process and develop software tools that automatically generate the code for each of these steps. Experiences using the toolset are described for six encoding schemes of varying complexity: the Thesaurus Linguae Graecae, the Dictionary of the Old Spanish Language, the Lancaster-Oslo/Bergen Corpus, the Oxford Concordance Program, WATCON-2, and Scribe. Use of the toolset represents a savings in coding effort ranging from 4.3 to 23.2 lines of code generated per line of specification in the toolset. Approximately 98 per cent of the retagging code for these encoding schemes was automatically generated by the toolset.

Keywords/Search Tags:

Encoding schemes, Retagging, Toolset, Code, Model, Documents

Related items

1	Comparison of encoding schemes for symbolic model checking of bounded petri nets
2	ChartIndex: A contextual approach to automated standards-based encoding of clinical documents
3	Creation of encoding schemes to reduce markup language-based overhead
4	Research Of Encoding And Decoding For 2D Bar Code
5	Research On Recognition Technology And Encoding Theory Of High-Dimensional Bar Code
6	Research On Secrecy Encoding Schemes Using Polar Code Over Gaussian Wiretap Channels
7	The Design And Implementation Of Uniform Encoding Management System In XXX Locomotive & Rolling Stock Works
8	Research On XML Data Encoding Scheme Of Supporting Updating Data
9	Research And Application Of The 2d Barcode-QR Code Encoding & Decoding System
10	Design And Implementation Of Secure Operating System Testing Toolset