Font Size: a A A

Study On Key Techniques For Informatization And Digitalization Of Tangut Character

Posted on:2020-12-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F MengFull Text:PDF
GTID:1365330578452356Subject:Traffic Information Engineering & Control
Abstract/Summary:PDF Full Text Request
Tangut Empire also known as Western Xia Dynasty,was an empire which existed from 1038 to 1227 in what are now the northwestern of Ningxia,Gansu,eastern Qinghai,northern Shaanxi.The early capital was established at Ningxia.The Tangut script was a logographic writing system used for the extinct Tangut language.The Tangut script,promulgated as the official script of the Tangut state,was in common use within the Tangut state for slightly more than 200 years.With the utter destruction of Tangut power in 1227 by the Mongols,along with the political entity most of its written records and architecture were destroyed.Therefore,its founders and history remained obscure,and the Tangut script were no longer used.Until beginning of 20th-century great amount of extant Tangut texts were excavated at Khara-Khoto in 1909 by Pyotr Kozlov.The discovery inaugurated a new era of research of history and civilization of Tangut and research of Tangut script which had been forgotten for hundreds of years play an important role in the Tangut research.With the development of information and artificial intelligence technology,informatization and digitalization of Tangut character is considered to be a research issue claiming practical value,it will provide a helpful application for Tangut researchers and improve the efficiency of Tangut research,in the aspect of unscrambling the character in ancient Tangut scripts.In this thesis,technic of image process,pattern recognition,deep learning was studied to solve several key problems of informatization and digitalization of Tangut character.The main content is listed as follows.I)Study of revised Hough transform for the detection of character strokes.Hough Transform is suggested in the process of detecting the strokes of the character.Hough Transform is one of the most used procedures in morphology image processing,while for the purpose of character recognition,revised measure is needed.Hough Transform with Guidance of Endpoints(HTGE)was proposed in this thesis to accommodate the character recognition.Improvements of HTGE:(1)Take advantage of endpoints to reduce the computational burden of Hough transform.(2)Detect the strokes with tolerance for the non-strictly straight line.(3)Judge the straight line with consideration of length of the line to avoid neglecting the short lines.Experiments results were presented to verify the improvement of the HTGE.2)Design and implementation of Tangut character sample database.The technic of optical character recognition,machine learning and computer vision will help a lot in the unscrambling of the character.But all these technics are based on the character database which provide learning sample and test standard.However little research has been devoted to the building of Tangut character database.Design and implementation of Tangut character databases were introduced in this thesis.The ancient Tangut scripts were used as data source to extract text and single character sample for the database.The organization and method of retrieval of database were discussed in this thesis.Proposal of how to use our database in the research scenario of:character recognition,text segmentation and ancient script retrieval was also presented in this thesis.3)Study on the problem of imbalance sample distribution in the Tangut character databases.In the process of building the Tangut Character Databases with ancient Tangut scripts as data source,it is found that the problem of imbalanced class distribution significantly hinders the performance of learning algorithms.A method of synthetic sample generation was proposed in this thesis to improve the performance of learning and recognition of Tangut character.The comparison of recognition accuracy between the recognition model base on original dataset and synthetic generated dataset was demonstrated,and presenting an impressive superiority with our synthetic generated dataset.4)Tangut character recognition base on deep learning.With the Tangut character database discussed in this thesis as training and testing dataset,the method of Tangut character recognition with deep learning technic was researched in this thesis.Several models base on various deep learning architecture were designed,trained and tested.The performance of the recognition models was demonstrated,and the effectiveness of synthetic sample generation of the dataset was also proved to be valuable.
Keywords/Search Tags:Tangut script, sample dataset, sample extension, OCR, deep learning
PDF Full Text Request
Related items