Font Size: a A A

The Research Of Announcement Information Extraction

Posted on:2021-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:J A ZhangFull Text:PDF
GTID:2428330626960392Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The development of the information age makes it difficult for humans to process the massive amounts of data generated daily,which brings great difficulties and challenges to analytical tasks such as venture capital,data trends,and financial supervision.In this context,automated information extraction technology has become an effective means to solve this problem.This thesis uses automated information extraction technology to extract the effective information in the picture table announcements and personnel transfer announcements.Table is an important form in the announcement,and the table in announcement is often saved as picture.It involves image processing and recognition technology.This thesis designs a method for identifying and processing the picture table in the announcement of the municipal government's budget and final account,identifying the outline of the table and the cells in the table,locating and intercepting the cells in the picture,extracting the content information of the cells,and saving The row and column position information of the cell.In the recognition process,the popular open source OCR tool Tesseract and the commercial recognition tool Baidu Intelligent Cloud OCR are used.Then comprehensively output the recognition result and the position of the row and column to extract the information in the table.Aimed at personnel transfer announcement,this thesis proposes a named entity recognition model based on BERTBiLSTM-CRF for name and position recognition.This model is based on the traditional BiLSTM-CRF named entity recognition model and uses the popular BERT model instead the Word2 Vec model to pre-train word vectors,which improves the effect of feature extraction and improves the results of named entity recognition.After recognizing the entity information,through the matching of keywords in the sentence,formulate rules to establish the relationship between the name of the person and the name of the person,and the name of the position,extract the complete personnel transfer information to facilitate other tasks for further processing.The picture table information extraction task uses Dalian budget and final accounts files as experimental data.The accuracy of Tesseract and Baidu Intelligent Cloud OCR recognition results are 83.67% and 99.57%,respectively.The text information extraction task uses the personnel transfer announcements published on the Sina Finance website.The BERTBiLSTM-CRF model proposed in this thesis achieves 98.7% and 91.63% F1 values in the identification of person names and job names,respectively,which is higher than the results of BiLSTM-CRF 4.35% and 3.11%;In terms of establishing the relationship between person names and job titles and outputting as the final result,the accuracy rate is 83.96%.
Keywords/Search Tags:Named Entity Recognition, Form extraction, Contour recognition, Optical character recognition
PDF Full Text Request
Related items