Font Size: a A A

Pattern Extraction And Registration Of Formatted Document Image

Posted on:2013-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:X M FangFull Text:PDF
GTID:2248330374488701Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Formatted documentation, the automatic identification of which is very important for office digitization, plays an important role in all paper-based information, and the type recognition of formatted documentation is a key step for the automatic identification of it. The paper focus on the extraction and the matching of the pattern of formatted document, that is:The pattern extraction of formatted documentation lies in three aspects:region segmentation, region characteristic extraction and the description of pattern. Firstly, the segmentation strategy from type to method is proposed in this paper to solve the problem of single segmentation method cannot segment all kinds of document. The form line is used to classify documents into table type and non-table type, while the table peak and the table unit is extracted based on form lines in table type document and the segmentation is achieved by top-down layout by projection segmentation method in non-table type document. Then, the quickly identification of image, headline and the text can be achieved by hierarchical recognition method based on characteristic analysis. At last, a two-stage pattern structure including pattern summary and pattern details based on the structure feature of formatted documentation, and the XML language is used to achieve the structural representation of document pattern.Existing pattern matching algorithm uses exact matching of parameters as matching criteria, which cannot meet the individual differences requirement which possibly exists in formatted documentation of the same type. According to the digital and differences of document pattern, a digital-based node similarity calculation method and changing weight-based path similarity calculation method is proposed in this paper. According to the problem of non-match modes’ matching consuming lots of time, the method of pattern matching from coarseness to fine is proposed. Firstly, the summary information is used to calculate the root similarity to get the candidate pattern set. Secondly the details of pattern are used to match the exact pattern base on the candidate pattern set. Finally the document pattern is calibrated by the matching pattern to ensure the validity of the extraction pattern.Experiment results show that the pattern extraction and pattern registration method proposed in this paper can extract forms, documents and business cards and other types of formatting the document pattern effectively. The algorithm has good adaptability in illumination changes and skew of document image. And the pattern matching algorithm proposed in this paper has high fault tolerance and robustness.
Keywords/Search Tags:formatted document image, skew correction, documentpattern extraction, document pattern matching
PDF Full Text Request
Related items