Font Size: a A A

Large-scale Oracle Bone Inscriptions Dataset Construction And Algorithm Research

Posted on:2020-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:H H WangFull Text:PDF
GTID:2415330575992710Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Oracle bone inscriptions is an ancient script originated from China,along with the hieroglyphics of ancient Egypt,the Harabar inscriptions of ancient India,and the cuneiforms of Babylon,are considered as the symbols of the four civilizations.The study of oracle bone inscriptions started since its first discovery in 1899.The study of oracle bone inscriptions relies heavily on professionals of ancient Chinese characters.At present,there are still many problems to be solved in the field of oracle bone inscriptions,such as oracle bone conjugation,handwriting identification,unscrambled character interpretation,semantic analysis and so on.At present,more than 150,000 pieces of oracle bones have been unearthed.Because of the fragility of oracle bones,they were easy to be damaged during the process of displacement.Few people can directly investigate the original Oracle bones,but utilize rubbing image is specific and important.With the development of computer technology,the research of oracle bone inscriptions has a font library,input method and a large number of copy materials.The aim of this paper is to construct a large-scale oracle bone inscription benchmark data set and carry out comprehensive experiments on this dataset to verify the performance of existing algorithms.Existing oracle bone recognition algorithms use data from oracle bone inscriptions database to extract a small number of sample crops corresponding to part of the scripts on the oracle bones.These character crops present simple background,few category,less noise,so the research algorithms developed on such dataset can't be directly applied to the real rubbings.In fact,there are many types of characters in the actual Oracle rubbings,only 1500-2000 words can be deciphered,with the other 3000 characters undeciphered.The distribution of data is extremely uneven,and the frequency of some words is very low.In addition,on the oracle bones,cracks caused by burning in divination process,or damaged in preservation process,and because of the quality of rubbing techniques and so on,there exists serious noise in rubbings.Based on the above,the research content of this paper can be divided into three aspects:First,based on the rubbing images,a character-level-labeled oracle bone inscription data set is constructed.The data set can be used for text detection and recognition.Baseline experiments show that the deep learning algorithm can achieve higher detection accuracy with 6,000 training samples,but the recognition accuracy is relatively low.Second,based on oracle bone inscription benchmark data set and the baseline experiments,two minor improvents works have been proposed: First,we improve the basis pursuit de-noising algorithm.We find the sparse representation algorithm to be an effective method to solve the oracle script recognition task.On this basis,the updating method of the support set of the base tracking denoising algorithm is adjusted to improve the accuracy of the small sample data set.It is suitable for the text recognition of the strong noise and multi-class unbalanced samples,and achieves a higher accuracy under a small number of features.The experimental results show that the adapted base-tracking denoising algorithm achieves better results than the deep learning algorithm and the original base-tracking denoising algorithm.And it achieves a balance in accuracy,running time and resource utilization.Next,the batch strategy is applied to the activity set algorithm using non-negative quadratic programming.In order to solve the problem of memory overflow caused by excessive memory occupied by the sparse representation-based method when the sample size is too large,this paper proposes to apply batch processing strategy to the non-negative quadratic programming activity set algorithm,we design multiple classifiers,and take the experimental results of the optimal classifier as the final prediction accuracy.Experiments show that the batch non-negative quadratic programming activity set method has better performance than the original algorithm on a few data sets more importedly,it mitigates solving the problem of memory overflow with too large sample size.Third,we carry out benchmark experiments on 33 algorithms which include non-sparse representation,deep learning and sparse representation ones.Then,13 algorithms are selected to conduct large-scale experiments on 15 data sets.We report and compare their accuracy,running time and resource utilization,to find the most suitable algorithm for Oracle data sets.
Keywords/Search Tags:Oracle rubbing recognition, Text recognition, Sparse representation, Effective set, Batch processing
PDF Full Text Request
Related items