Font Size: a A A

Research On Classification Algorithm Of Ancient Chinese Characters Based On "Long Tail Distribution"

Posted on:2022-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HeFull Text:PDF
GTID:2518306752954279Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
This paper mainly studies the automatic recognition of oracle bone inscriptions,shang and Zhou inscriptions,Spring and Autumn and Chu slips,wei,Jin and Southern and Northern Dynasties stone inscriptions.They have a long history,which is quite different from modern Chinese,and have a high threshold of reading.It is of great significance to use artificial intelligence method for automatic recognition.There is a difference between common words and remote words when characters are used.According to the data set of this paper,there is an unbalanced distribution.Some common words may have thousands of samples,Some out-of-the-way words may only have dozens of samples.If the original data is directly put into the neural network for training,the model will inevitably produce a large bias,and the model will learn many features of the characters with a large number of samples,resulting in poor recognition effect on the characters with a small number of samples.Therefore,this paper mainly studies how to improve the recognition accuracy of the model in categories with fewer samples under the condition of unbalanced distribution of ancient character data sets.The main work of this paper is as follows:1.The training process is divided into two phases,first is the balance learning phase,and then is the primitive distributed learning.Experiments show that TPT algorithm has a certain improvement effect.2.Dataset in this paper are based on ancient Chinese characters,the same number of wording,while some of this difference in a data set is huge,but the form has the similarity,head values(sample size is more category)the characteristics of migration to the tail class(sample a small number of categories)to use has certain feasibility,data show that a Feature Transfer algorithm Based on Head Classes(FTBHC)is better than TPT algorithm.3.Following the thought of FTBHC,a Feature Transfer algorithm Based on Shike?Set(FTBS)is proposed.Although the four Chinese characters belong to different periods,they have an evolutionary relationship and are similar to each other.The Shike?Set has the most abundant sample information,so it is the best choice to use it as the basis of feature transfer.The data show that FTBS algorithm is superior to FTBHC.4.A Bilateral-Branch Network algorithm based on Self-supervised Pre-training(BBN-SSP)is proposed under the framework of self-supervised learning using the unbalanced classification algorithm with good effect without feature transfer from other data sets.Self-supervised pre-training(SSP)algorithm has a good effect on improving unbalanced classification,such as the Bilateral-Branch Network algorithm Tailor-made to solve the ”long tail” classification problem.Experimental data show that the combination of the two algorithms can achieve the same effect as FTBS algorithm.Although the effects of the algorithm are different,different scenarios are applicable,so the most appropriate algorithm needs to be selected for different scenarios.Although the research content of this paper is mainly aimed at the field of ancient characters,the research ideas can be extended to other data classification with ”long tail distribution” in the real world.
Keywords/Search Tags:The long tail distribution, CNN, Two-pahse Train, Transfer Learning, Self-supervised Learning, Bilateral-Branch Network
PDF Full Text Request
Related items