Font Size: a A A

Chinese Character Recognition Of Ancient Books Based On Lightweight Convolutional Neural Network

Posted on:2023-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y C LiangFull Text:PDF
GTID:2555307118499424Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The digitization technology of ancient books is becoming an important learning tool for ancient Chinese researchers,and Chinese character recognition is the core technology.However,most Chinese character recognition methods require high computational resources and storage,which is difficult to deploy on devices with limited computing performance and storage resources.This problem is more prominent in Chinese character recognition tasks for the massive category of characters.Because the frequency of commonly-used characters is much higher than that of uncommon characters,Chinese character image datasets collected from ancient book images have a problem of severe imbalance of character samples,showing a significant long-tailed distribution.It is difficult for the current Chinese character recognition methods to recognize uncommon characters in tail class.Additionally,because it is difficult and time-consuming to label uncommon characters images manually,there will be many unlabeled images in the ancient Chinese character dataset,making it difficult for the model to train.Because of the above problems,this thesis conducts a systematic study.First,we design a lightweight network for Chinese character recognition based on various model lightweight techniques.After that,this thesis constructs a lightweight ensemble model for long-tail data.Finally,this thesis designs an iterative semisupervised learning method for the ancient Chinese character dataset containing many unlabeled images.The ensemble model can further improve the recognition accuracy by learning unlabeled images.The main research contents of the thesis are as follows:1)To deploy Chinese character recognition models on devices with limited hardware resources,this thesis designs a lightweight network for Chinese character recognition to reduce the amount of model parameters as much as possible while ensuring accuracy.Based on the Mobile Net V1 model,this paper proposes a lightweight CNN(Convolutional Neural Network)for Chinese character images.Combined with the coordinate attention module and nonlinear activation function,a directly connected depthwise separable convolution structure is constructed.The connection layer performs parameter quantization to reduce the model size.The lightweight CNN constructed in this paper achieves an accuracy of more than 98% with a model volume of 3.9MB on the large Chinese dictionary dataset containing 55,360 Chinese characters.On the ICDAR2013 handwritten Chinese character test set,the accuracy is similar to that of the mainstream methods,only 0.7MB model size is required.2)Aiming at the problem that the sample size of rare characters in the ancient Chinese character dataset is far less than the commonly used characters,which makes the model learning difficult,this paper studies a lightweight long-tail learning model for imbalanced datasets.First,the lightweight network designed in this thesis is divided into two parts: the head network is used as a shared feature extractor,and the tail network is used as an expert model.The averaged output is obtained through multiple expert models,and the number of the used expert model is dynamically adjusted.The experimental results on the ancient Chinese character datasets MTHV2 and TKH show that the lightweight ensemble model achieves up to 5times higher accuracy than the single lightweight model on the tail class samples with similar computational efficiency.3)The thesis studies iterative semi-learning methods to make model trains on unlabeled ancient Chinese character samples.This thesis uses a semi-supervised method based on consistency regularization to make the lightweight ensemble model learn both labeled and unlabeled samples simultaneously.The trained model can produce a higher-quality pseudo-labeled sample set through the distribution alignment strategy.Then the resampling strategy is used to add part of pseudo-label samples to the labeled sample set.The thesis repeats the above process to achieve the iterative training of the model.The experimental results on the MTHV2 and TKH datasets,containing many unlabeled samples,show that the model’s accuracy improves by up to10.4% through iterative semi-learning.This thesis mainly aims at the problem of high parameters of the recognition model based on massive Chinese characters and the shortcomings of conventional CNN in the learning process of ancient Chinese characters.The experimental results show that the lightweight model in this paper can maintain a low parameter amount.At the same time,it can significantly improve the recognition accuracy of rare words with few samples,and can make better use of unlabeled samples for learning,which has certain practical value.In addition,based on the lightweight CNN of this paper,this paper builds a Chinese character recognition system based on We Chat applet.
Keywords/Search Tags:Chinese character recognition, Lightweight model, Ensemble model, Long-tailed learning, Semi-supervised learning
PDF Full Text Request
Related items