Font Size: a A A

Research On Layout Segmentation Method For Historical Tibetan Documents

Posted on:2019-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ZhangFull Text:PDF
GTID:2428330593950217Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,as people pay more and more attention to the protection and inheritance of traditional history and culture,more and more researchers focus on the digitalization of historical documents research.Tibetan is a nation with rich traditional cultures and it is also an indispensable part of the five thousand-year Chinese civilization.Historical Tibetan documents are a shining pearl in the treasures of Tibetan traditional culture.As a carrier for ancient Tibetan civilization,it has received extensive attention from historians,linguists,and Buddhists scholar and etc.Thus,the application of using digital technology to extract,identify and convert the existing texts in historical Tibetan documents into digital forms has important implications for the study,protection and inheritance of Tibetan history and culture.Layout segmentation is an important basic step in the process of digitalization.In order to explore the suitable layout segmentation method for historical Tibetan documents,the main research work of this paper is as follows:Firstly,this paper reviews the research status and development trend of layout segmentation at home and abroad,analyses the characteristics of various layout segmentation methods applied to different document layouts,and figure out a suitable layout segmentation method for historical Tibetan documents.By summarizing the methods proposed by previous researchers for different characteristics of the literature layout,this paper learns the strategies employed by researchers in the layout of different types.Secondly,after the investigation,we found that the text region of historical Tibetan document images has a larger density of corner points than other non-text regions.Therefore,we attempt to combine this feature with the connected component analysis to extract text regions from historical Tibetan documents.In order to examine the method of combined corner point density and connected component analysis,this paper proposes a text extraction method based on block projection for historical Tibetan documents.By combining the classification information of connected components and the corner point density information,the equally divided image blocks of the historical Tibetan document are filtered.Next,we analyze projections of the filtered image blocks to obtain the approximate edge position of the text region.With the combination of the approximate edge position of the text region and the text region edge search strategy,the approximate edge of the text region can be consulted.At last,in order to correct the irregular edges of text region caused by character sticking and etc.,coordinate correction is performed on the extracted edge points through a corrective strategy.Experiments on historical Tibetan documents dataset have shown that this method can effectively extract a complete and regular text region from historical Tibetan documents.Finally,this paper constructs a convolutional denoising autoencoder framework and a layout analysis method for historical Tibetan documents.The first step of this method is conducted by superpixel clustering is performed on the original image,and the similarity local pixels with in the original image are clustered into a superpixel block.Afterwards,the feature of superpixel block is extracted by the convolutional denoising autoencoder,then the SVM classifier is used to classify superpixel blocks.In conclusion,experiments show that this method can effectively classify the superpixel blocks belonging to different layout elements in historical Tibetan document images.
Keywords/Search Tags:historical Tibetan documents, text extraction, layout analysis, block projection, Autoencoder
PDF Full Text Request
Related items