Ancient Chinese medicine has a glorious history and many precious books have been left behind.Using Optical Character Recognition(OCR)technology to digitize paper-based ancient books of Traditional Chinese Medicine(TCM)will greatly help the study of ancient TCM books theories.The prevailing text recognition technology in the current era still encounters numerous challenges when applied to the recognition of ancient TCM books.The processing of textual images derived from these antiquated literary works necessitates the resolution of issues such as noise reduction,seal removal,and rectification of text image distortions.Furthermore,the extraction of content from these archaic texts entails addressing complex problems including the abundance of rare Chinese characters,the scarcity of annotated text datasets,and the potential omission of text detection within the ancient texts.To address the aforementioned concerns,this article presents the following scholarly endeavors pertaining to the digitization of ancient TCM books.(1)A novel approach is presented to address the challenges of noise reduction in ancient book digitization using a generative adversarial network(GAN).The presence of background noise and seals in certain text images extracted from ancient TCM books can significantly impede subsequent text extraction processes.To mitigate this issue,we propose a two-stage denoising GAN framework that leverages the power of generative adversarial networks.This framework effectively eliminates unwanted background noise and seals,enhancing the quality of ancient book images for improved text extraction and analysis.(2)An algorithm is introduced for rectifying distorted ancient book images,utilizing a combination of the Mask R-CNN network and Bezier curves.Distortions in the textual content of ancient book images can pose significant challenges for subsequent text recognition tasks.To address this issue,our proposed algorithm incorporates the Mask R-CNN algorithm for accurate edge point detection in distorted textual images.By leveraging the extracted edge points,we achieve effective correction of the distortions present in ancient book images,thus enhancing their legibility and enabling improved text recognition performance.(3)Based on the DBNet network,we present an optimization approach for detecting text in ancient TCM books.Conventional text detection algorithms often suffer from the problems of missed detections and duplicate detections when applied to ancient book texts.To address these challenges,we augment the DBNet network with dilated convolutions and channel attention mechanisms,thereby proposing an algorithm specifically designed for detecting text in ancient book contexts.Through this enhancement,we successfully mitigate the issues of missed detections and repetitive detections,thereby improving the overall accuracy and reliability of text detection in ancient TCM works.(4)Ancient TCM texts exhibit a significant prevalence of rare and unconventional Chinese characters,and the scarcity of annotated datasets suitable for training text recognition models further compounds the challenges.To address these limitations,this study undertakes the construction of a comprehensive Chinese character database comprising 48,709 distinct characters.Additionally,a dedicated text annotation system is developed to facilitate the inclusion of rare Chinese characters.Considering the imperative need for an expanded dataset that complements authentic annotations,we propose a novel method for synthesizing a dataset specific to ancient TCM texts through the segmentation of individual characters.(5)Upon successfully extracting the textual content from ancient Chinese medical texts,it becomes imperative to undertake the extraction and classification of named entities pertaining to TCM.This crucial step aims to facilitate the systematic organization and comprehensive summarization of TCM knowledge.(6)A comprehensive electronic system for the digitization of ancient TCM books has been designed and developed,encompassing four essential modules:text image preprocessing,text recognition,extraction of named entities in TCM,and user management.Thorough testing and meticulous analysis have confirmed that the system adeptly meets the designated requirements and exhibits the intended functionalities,thus validating its effectiveness and suitability for digitizing ancient TCM books. |