Font Size: a A A

Research On Information Extraction And Fusion Of Knowledge Graph For Unstructured Data

Posted on:2022-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:J R LiaoFull Text:PDF
GTID:2518306524984429Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The development of ”Internet Plus” has made global data production rise signifi-cantly.Unstructured data account for 80% of the total data and become the main compo-nent of global data.Unstructured data processing has always been the focus and difficulty of natural language processing research.The emergence of knowledge graph provides a complete and feasible solution for big data processing.As a basic task,information ex-traction and fusion in the construction of knowledge graph has been widely concerned.For unstructured data,thesis focuses on the task of information extraction and informa-tion fusion in knowledge graph.For the research of information extraction technology,thesis proposes an entity information extraction model(POStag- Attention- LSTM -CRF,PALC)combining part of speech attention mechanism.The accuracy of information ex-traction is improved by using part of speech features.In order to explore the relationship between information extraction and fusion task,thesis proposes a joint learning model of entity information extraction and fusion(POStag- Attention -LSTM- CRF- Dynamic Con-text Augmentation,PALC-DCA).Adding feedback module in PALC-DCA model makes the two tasks depend on each other and improves the indicators of the task.The main research work of thesis is as follows:(1)In order to make the entity information extraction model better learn the semantic expression of words,thesis proposes a PALC model combined with part of speech atten-tion mechanism.PALC model uses part of speech tagging tool Stanford Parser to tag all unstructured data.PALC model uses recurrent neural networks(RNN)to learn part of speech features of words in sentences.RNN network for part of speech feature learning provides more features for information extraction model.Moreover,the part of speech feature can represent the category and attribute of the word in the sentence.The part of speech feature can assist PALC model to obtain more accurate semantic features,so as to improve the accuracy of information extraction.(2)To solve the problem of feature fusion in information extraction model,thesis proposes a feature fusion method based on attention mechanism and multi-layer bidirec-tional long short term memory(LSTM)network.The multi-layer bidirectional LSTM network is used to obtain the semantic expression of words.The relationship between different parts of speech and semantic expression is obtained through the attention mech-anism.The weight matrix to express the influence of parts of speech features is further obtained.The weight matrix is multiplied and added with the part of speech features of different words,and then spliced with other features to obtain the vector expression of words.The semantic features of words are learned again through multi-layer bidirectional LSTM network to obtain more accurate semantic expression of words.The experimental results show that the accuracy,recall and F1 score of this method are 90.65%,91.06% and90.84% respectively.(3)In the research of joint learning of information extraction and information fusion,thesis proposes a PALC-DCA joint learning model of entity information extraction and fusion.PALC-DCA joint learning model unifies the datasets of information extraction and information fusion.Through the research on the joint use of datasets,the coding query mechanism is established,and the shared dataset scheme is provided,so that the two datasets can be used in the same framework.The dataset sharing method provides the data foundation for the joint learning of information extraction and information fusion.(4)In order to better combine information extraction with information fusion task,thesis proposes to add feedback module into the fusion model of PALC-DCA joint learn-ing model.The feedback module uses convolutional neural networks(CNN)to learn the description information of candidate entities in the third-party knowledge base,and mul-tiply the description information with the local score.Then,the probability distribution of entity class labels in information extraction is obtained by feedforward network.Fi-nally,the conditional random field(CRF)layer of the entity information extraction model is used to obtain the entity class label results.The experimental results show that the accuracy of information extraction is improved by adding the feedback module.The ac-curacy of information extraction in CONLL03 dataset is 90.93%,the recall rate is 91.12%,and the F1 score is 91.02%.Joint learning makes the accuracy of information fusion in AIDA?CONLL dataset reach 94.24%,the recall rate is 94.14% and the F1 score is 94.18%.
Keywords/Search Tags:knowledge graph, information extraction, information fusion, part of speech attention mechanism, joint learning
PDF Full Text Request
Related items