Font Size: a A A

Document boundary determination using structural and lexical analysis

Posted on:2008-12-24Degree:M.SType:Thesis
University:University of Nevada, Las VegasCandidate:Cartright, Marc-AllenFull Text:PDF
GTID:2448390005956340Subject:Computer Science
Abstract/Summary:
A method of sequentially presented document determination using parallel analyses from various facets of structural document understanding and information retrieval is proposed in this thesis. Specifically, the method presented here intends to serve as a trainable system when determining where one document ends and another begins. Content analysis methods include use of the Vector Space Model, as well as targeted analysis of content on the margins of document fragments. Structural analysis for this implementation has been limited to simple and ubiquitous entities, such as software-generated zones, simple format-specific lines, and the appearance of page numbers. Analysis focuses on change in similarity between comparisons, with the emphasis placed on the fact that the extremities of documents tend to contain significant structural and lexical changes that can be observed and quantified. We combine the various features using nonlinear approximation (neural network) and experimentally test the usefulness of the combinations.
Keywords/Search Tags:Using, Document, Structural
Related items