Information Extraction System For Three Types Of Information Disclosure Announcements Of Listed Companies

Posted on:2020-10-06

Degree:Master

Type:Thesis

Country:China

Candidate:B B Wang

Full Text:PDF

GTID:2428330590971648

Subject:Electronic and communication engineering

Abstract/Summary:

With the development of the Internet-based finance business,listed companies publish a large number of announcements through information disclosure websites every day.The information in the announcements plays a vital role in investment analysis,enterprise interests,market impact and economic resource allocation.Information disclosure announcements is a kind of unstructured text.Its information distribution is scattered and there is too much redundant information.Traditional information extraction systems have many limitations.It is difficult to extract key information of announcement quickly,efficiently and accurately.This thesis designs an information extraction system based on document structure and deep learning model for three types of information disclosure announcements of listed companies.The specific research contents are shown as follows:1.A document structure tree algorithm is defined to restore the hierarchical structure of announcement text accurately.On the basis of tree structure,multi-class information extraction methods are designed according to the range of information.It mainly includes the extraction of the content in the node,which can accurately locate the key nodes(chapters)and extract the content;the extraction of information sentence based on expansion of triggers in the sentence can accurately extract the structured information from the content of node;the extraction of table information can accurately locate the required table and extract its structured content.The experimental results show that the F1 value of extraction from information sentences and tables can reach more than 93%,and the F1 value of extraction from structured fields of tables can reach more than 97%.2.Considering structural fields extraction of sentences as a problem of sequence labeling,a deep learning model is constructed for automatic recognition of fields.Firstly,a financial domain knowledge dictionary is constructed,in which dictionary are added to ensure the accuracy of word segmentation in various announcement sentences;secondly,Word2 vec is used to pre-train professional domain word vectors on large corpus to map the input words into low-dimensional real vectors;finally,a deep learning model based on Bidirectional Long Short-term Memory(Bi-LSTM)networks is constructed,and Conditional Random Fields(CRF)are introduced to strengthen labels.Relevance constraints are used to fuse contextual information for automatic recognition of structured information.In model training,a semi-automatic method of corpus annotation and revision is used to construct training corpus.The final experimental results show that the average F1 value of fields extraction can reach more than 92%.3.Integrating all kinds of methods and algorithm models,three kinds of announcement information extraction system for listed companies is designed and implemented according to practical application requirements.The system mainly includes four modules: announcement acquisition module,document structure tree generation module,information extraction module and display-storage module.Finally,the function of the whole system is tested,and the test results show that the information extraction results of each part are accurate.

Keywords/Search Tags:

information disclosure announcement, information extraction, document structure tree, deep learning, word vector

Related items

1	Research Of Web Information Extraction Based On Table Structure
2	The Adaptive Web Information Extraction Based On Single DOM Tree Characteristics And Classification
3	Research And Application Of Web Information Extraction And Webpage Summarization
4	Intermediate Document Xml-based Information Extraction Technology Research
5	Pattern-Based Information Extraction From HTML Documents
6	Research On The Method Of Sensitive Word Vector And Sentiment Classification Based On Deep Learning
7	Research And Implementation About Assisted Writing System Of Traffic Information Standards
8	Research On Cross-language Information Extraction Based On Deep Learning
9	Multiple Documents Automatically Summary Based On Semantic Word Vector
10	Study On Information Hiding Techniques Based On Word Text Document