Font Size: a A A

Research And Implementation Of Data Extraction Oriented To Knowledge Graph

Posted on:2022-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:J C ZhuFull Text:PDF
GTID:2518306524993419Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the field of artificial intelligence,knowledge graphs have received extensive attention.Structured data can be directly used for the construction of knowledge graphs.However,structured data has the problems of insufficient volume and slow update speed,which leads to the insufficient utilization of the capabilities of knowledge graphs.The unstructured text data on the Internet has exploded every day and has comprehensive coverage.Therefore,extracting structured triple data from these texts that can be used by the knowledge graph has great value and significance.For the extraction of text data,traditional methods generally adopt a pipelined extraction mode to split the triple extraction task into two independent subtasks: entity extraction and relationship extraction.This method has error propagation,information redundancy and neglect.The two subtasks are related to each other.In order to solve the problems in the pipeline extraction mode,this thesis has carried out research on the joint extraction method of entities and relations.The main work is as follows:1.Aiming at the above-mentioned information redundancy problem,a joint extraction model based on labeling strategy is constructed,which converts the extraction of entities and relations into serialized joint labeling.By introducing a self-attention mechanism,the model can better capture the long-distance semantic relationship in the sentence,and at the same time,the bias weight is introduced,which reduces the influence of other useless tags and improves the extraction performance of the model.The F1 value of the model on the NTY data set reaches 51.8%,which proves the effectiveness of the model.2.In the face of sentences with overlapping triples,current joint extraction models usually do not perform very well in this respect.Aiming at the problem of overlapping triples,a joint extraction model based on the decomposition and labeling strategy is constructed.The extraction task of triples is divided into two subtasks: head entity extraction,relationship and tail entity extraction and separate labeling.First,identify all possible sentences in the sentence According to the head entity,find the possible tail entities under each relationship.By introducing the result error correction module to filter the extraction results,the accuracy of the model is improved.Finally,on the Web NLG data set,the F1 value of this model is increased by 1.2% compared with other models,and the best extraction effect is achieved.3.In order to collect enough text data from the Internet for information extraction,a distributed data collection framework is designed and implemented.The framework adopts a one-master,multiple-slave mode to realize the parallel collection of target data on the Internet,which greatly improves the efficiency of data collection.Finally,the framework realizes the collection of news data in the entertainment field,and provides corpus support for subsequent information extraction in this field.4.Combining the extraction model proposed in this thesis with the news data collected from the Internet,the news text data in the entertainment field is converted into triple data,which provides data support for the subsequent construction of the knowledge graph in the entertainment field.
Keywords/Search Tags:information extraction, joint extraction, information redundancy, triple overlap, data collection
PDF Full Text Request
Related items