In recent years,with the rapid development of image processing and artificial intelligence technology,the accuracy of optical character recognition technology has improved,and a large number of products related to character recognition have been derived.Tabular documents are important tools for statistics and collation,and are used frequently in production and life.Due to the complexity of the overall structure and text layout of the table-type documents,the traditional text recognition products are used to extract and recognize the text of the table-type documents.The extraction of the table content is chaotic,and the accuracy of text recognition is low.At present,there are few text extraction products for table documents in the market,and they cannot meet the actual production needs.It is of practical significance to design a text recognition system for table images.Based on the current achievements of image processing and artificial intelligence technology,this article designs a table text extraction system based on image processing and artificial intelligence.The main research work and innovative points include:(1)A new type of table image preprocessing system is proposed based on the characteristics of table images in practical production and life.The main task of the preprocessing system is to eliminate interference from other images that are not text or table boxes in the image;Optimize the clarity of text and table frames;Adjust the tilt angle of the table image;Adjust different colors of text and table boxes to black,background color to white,etc.After being processed by a preprocessing system,table documents can significantly reduce the interference caused by collecting table images and improve the accuracy of subsequent table content extraction and text recognition.(2)Based on the idea of data flow and combined with artificial intelligence technology,an innovative algorithm framework including table structure recognition,text detection,and text recognition has been proposed.Each algorithm part implements specific functions and combines to complete the overall table text extraction and recognition.Each part of the algorithm can be designed separately according to different demand scenarios to complete the correctness verification of different algorithm parts.In addition,differentiated design can also be carried out based on whether the extracted text position is within the table,retaining both the table text portion of the image and the non table text content.(3)According to the design requirements of the algorithm section,a complete set of annotation methods has been innovatively designed for the training and validation dataset involved in the algorithm.Most of the datasets used are collected from real table text data,and some have been enhanced to meet the various interference situations encountered in verifying the collection of table images in real scenarios.At the same time,some commonly used methods for annotating and enhancing datasets,sources of data collection,and technical indicators commonly used to evaluate algorithm accuracy were introduced.(4)In order to truly restore the content of table text,an innovative post-processing system was designed to correctly arrange the table and text recognition results.The post-processing system restores the table structure based on the table structure algorithm,fully restoring the framework of the table,and orderly placing the identified content in the corresponding table boxes according to the structure of the text detection algorithm,maximizing the restoration of the true content of the identified document. |