Font Size: a A A

Unstructured Information Extraction Methods For Domain-Specific Knowledge Graphs

Posted on:2023-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2568307103485804Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Domain knowledge graphs can allow machines,like humans,to understand the deep meaning of texts by storing knowledge.This prevents machines from making wrong reasoning and major decision mishaps.However,a huge amount of structured data is necessary in the process of domain knowledge graph construction.Manual conversion of unstructured textual data into structured data is very expensive and requires high professionalism.For this reason,this paper proposes an automated domain knowledge graph-oriented unstructured information extraction method.Therefore,it becomes significant to study how to use information extraction techniques to obtain structured data.The existing information extractions are mainly based on sequence annotation methods and span representation-based methods.Sequence labeling-based approaches are divided into three directions: labeling,table filling and sequence-to-sequence.These studies prefer BIO(begin,inside,outside)/ BILOU(begin,inside,last,outside,unit)-based strategies,which easily lead to entity nesting problems.The span-based representation approaches can perform detailed search on all spans and identify entity nesting efficiently.Among the most advanced span-based models,Sp ERT contributes by using a sufficient number of strong negative samples and localized contexts,but the model still suffers from the lack of explicit boundary supervision on entities and inadequate utilization of domain-specific information.For this reason,we propose an information extraction(IE)method based on attention contribution degree and an IE method for the judicial domain,respectively.The core contributions of the work in this paper are as follows:1.To enhance the sensitivity of the model to entity boundaries and the mining of domain-specific information,we introduce attentional contribution degree as boundary confidence.Specifically,a span classifier based on the multilayer perceptron-softmax structure is connected to the attention head residuals of each layer,which makes the model without losing the original information of word elements as the depth increases.The experiments demonstrate the advancement of the method in the fields of journalism,science and medicine,especially outperforming the current state-of-the-art in the Sci ERC(scientific information extractor)dataset list.2.To solve the problem of difficulty in extracting special elements based on the database level,we select five of the more complex evaluation indicators(Provided by the Law School of Xiangtan University.It is used to evaluate the effectiveness of judicial reform)and propose novel and more professional matching rules under the guidance of legal professionals,which are combination of locating key segments,matching keywords and multiple logical reasoning.3.A deep learning model based on contextual information computation strategy is proposed for entities that cannot be identified by matching rules and depend on semantic context.This strategy first constructs the inter-word normalized attention scores into multi-channel "attention graphs",and then trains multi-scale convolutional pooling layers to compress the "attention graphs" into multi-channel "attention points".The "attention points" are used as contextual information for downstream tasks to enhance inter-word dependency.It is experimentally demonstrated that this method outperforms the mainstream Bert-BiLSTM-CRF model.
Keywords/Search Tags:information extraction, attention score, Transformer pre-training model, domain knowledge graph
PDF Full Text Request
Related items