Font Size: a A A

Design And Construction Of The Uchen Script Tibetan Transliteration Of The Sanskrit Ancient Character Sample Database

Posted on:2019-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:F M ZhouFull Text:PDF
GTID:2428330548964047Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays,with the rapid development of AI,all researches are based on data.Data acquisition has become the key of all research.Similarly,the identification of research work can not be separated from the support of the sample database.In order to study the Uchen Script woodcut version of the Tibetan transliteration of the Sanskrit in document analysis and recognition,construction of the Uchen Script Tibetan transliteration of the Sanskrit character sample library is indispensable.According to the statistics,the Tibetan transliteration of the Sanskrit character sets consisting of 7240 characters(including basic set,extension set A and set extension B).There are 7240 characters in each set of character sample,and it needs 5000(a total of 36200000 words)sets for the research,a large amount of manpower and material resources are needed,and the cost is high and the short time is difficult to complete.Therefore,according to the composition principle of the font structure of Tibetan characters,this paper uses the method of superposition of parts of the word to generate words,and constructs the sample library.The specific contents include the following aspects:(1)Acquisition component sample,the development of "Uchen Script parts of Tibetan ancient sample system" on PC,according to component table collection 170 components that can stack a sample of 7240 character samples.And the components will be preprocessing,such as gray and two value,smooth denoising and so on.(2)Synthesis algorithm design of character samples,"Uchen Script Tibetan word synthesis system" is developed on the PC platform.According to the structure of the character,reading the position information from "components position information database".And then using the corresponding position information,each component of the word is mapped to the corresponding position,and the samples is synthesized.(3)According to the BMP character sample has been generated,we successfully constructed the Uchen Script Tibetan transliteration of the Sanskrit ancient BMP image sample database and GNT database.And the storage of the word database is divided into based on the same word and the BMP image database.The storage of GNT samples is divided into three storage methods based on single word,based on the same word and based on the character set.In this paper,300 sets of components samples are collected by collection system,and a small number of missing parts will be gradually improved in the later work.Finally,5000 sets of Tibetan character sample have been synthesized by using two methods which is a set of component samples and randomly selected samples to synthesize the dataset,then constructed the sample database and it's format is consisted to the BMP and GNT.Due to the incomplete components sample,result in the Tibetan character samples of low frequency are missing,that is,less than 7240 characters per dataset.In order to solve this problem,this paper adopts a method to perfect the sample database of Tibetan character,first constructs the basic set of sample database,and then gradually expands the sample database,and finally reaches 7240 characters in each dataset.
Keywords/Search Tags:Uchen Script, Tibetan transliteration of the Sanskrit historical, collect components sample, superimposed, construct sample library
PDF Full Text Request
Related items