Font Size: a A A

Information Extraction System For Three Types Of Information Disclosure Announcements Of Listed Companies

Posted on:2020-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:B B WangFull Text:PDF
GTID:2428330590971648Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet-based finance business,listed companies publish a large number of announcements through information disclosure websites every day.The information in the announcements plays a vital role in investment analysis,enterprise interests,market impact and economic resource allocation.Information disclosure announcements is a kind of unstructured text.Its information distribution is scattered and there is too much redundant information.Traditional information extraction systems have many limitations.It is difficult to extract key information of announcement quickly,efficiently and accurately.This thesis designs an information extraction system based on document structure and deep learning model for three types of information disclosure announcements of listed companies.The specific research contents are shown as follows:1.A document structure tree algorithm is defined to restore the hierarchical structure of announcement text accurately.On the basis of tree structure,multi-class information extraction methods are designed according to the range of information.It mainly includes the extraction of the content in the node,which can accurately locate the key nodes(chapters)and extract the content;the extraction of information sentence based on expansion of triggers in the sentence can accurately extract the structured information from the content of node;the extraction of table information can accurately locate the required table and extract its structured content.The experimental results show that the F1 value of extraction from information sentences and tables can reach more than 93%,and the F1 value of extraction from structured fields of tables can reach more than 97%.2.Considering structural fields extraction of sentences as a problem of sequence labeling,a deep learning model is constructed for automatic recognition of fields.Firstly,a financial domain knowledge dictionary is constructed,in which dictionary are added to ensure the accuracy of word segmentation in various announcement sentences;secondly,Word2 vec is used to pre-train professional domain word vectors on large corpus to map the input words into low-dimensional real vectors;finally,a deep learning model based on Bidirectional Long Short-term Memory(Bi-LSTM)networks is constructed,and Conditional Random Fields(CRF)are introduced to strengthen labels.Relevance constraints are used to fuse contextual information for automatic recognition of structured information.In model training,a semi-automatic method of corpus annotation and revision is used to construct training corpus.The final experimental results show that the average F1 value of fields extraction can reach more than 92%.3.Integrating all kinds of methods and algorithm models,three kinds of announcement information extraction system for listed companies is designed and implemented according to practical application requirements.The system mainly includes four modules: announcement acquisition module,document structure tree generation module,information extraction module and display-storage module.Finally,the function of the whole system is tested,and the test results show that the information extraction results of each part are accurate.
Keywords/Search Tags:information disclosure announcement, information extraction, document structure tree, deep learning, word vector
PDF Full Text Request
Related items