Font Size: a A A

Design And Implementation Of Key Information Extraction System For Digital Archives

Posted on:2022-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:2518306602965509Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of computer technology and optical character recognition technology,digital files have become a new and popular information medium with its low cost,strong manageability and high efficiency of resource sharing.However,the total amount of digital files is growing rapidly,While the digital technology of archives still stays at the level of simple manual scanning,processing and processing of archives photos.It is not only high processing cost,low efficiency,but also unable to identify the specific information of the archives,which is very inconvenient for searching and further processing of key information of archives.On the other hand,the digital files are numerous,rich in types and different in format.When using the popular keyword extraction algorithm,it is difficult to find the appropriate corpus for training,and the extraction effect is poor.How to make the file management system identify the files accurately and screen and filter the key information of the archives has become a key issue in the further development of the digital technology of archives.In this paper,a pattern matching based file type recognition and segmentation algorithm is proposed,and a key information extraction algorithm for digital archives is proposed based on theme model and Text Rank.On this basis,a key information extraction system for digital archives is designed and implemented.The image type recognition and segmentation algorithm first preprocess and denoise the file pictures by Open CV technology,then intelligently identify the file types according to the characteristics of different files,and finally cut different types of files into different granularity.The key information extraction algorithm accepts the file text,filters,classifies,stops words and selects the words.After filtering them,they are used as the training theme model of the source data set to get the probability distribution of the document subject and the topic words in the text set.Then,the two probability distribution are used to modify the iteration formula of Text Rank and construct the word graph for iteration,and then extract the key words of the file text Words.For the class clustering out of the theme model,word2 vec is used to calculate the similarity of words to cluster the names.Finally,the algorithm uses pattern matching method to extract other key information,such as title,author,and document number.The digital file key information extraction system adopts MVC three-layer architecture,and it is divided into two modules: the file image segmentation and identification module and the key information extraction module.The file segmentation and identification module divides the file pictures and uses the Tesseract-OCR engine to identify the key words of the key information extraction module.After testing and putting into use,the system achieves the expected results,saves the manpower and material resources consumed in the digitization of archives,and improves the accuracy of key information extraction of digital archives.
Keywords/Search Tags:Archive Digitization, Keyword Extraction, Topic Model, Text Rank, Image Recognition
PDF Full Text Request
Related items