Font Size: a A A

Research On Key Technologies Of Constructing Person Entity Relation Graph For Public Information In The Web

Posted on:2020-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y C HuangFull Text:PDF
GTID:1488306548992639Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the human society has entered the era of big data,the data resources in cyberspace are becoming abundant.The amount of data has far exceeded the processing capacity of traditional data analysis technology and information system.Seeking effective information processing technology under massive data has become an urgent demand in various fields.Knowledge graph(KG),with its essence of semantic web,connects the concepts of entities,relations and attributes existing in the objective world into a huge network in the form of structured knowledge.As an important carrier of human knowledge,KG provides a convenient and fast solution for information acquisition in the context of big data.Person entity,as the hub of information interaction,often play a key role in the process of finding the target knowledge.Therefore,it is of great significance for information retrieval,intelligence analysis,commercial marketing and other various fields to construct a KG of the relation between person entities.However,on the one hand,the massive and irregular nature of network information has caused great difficulties in obtaining information.On the other hand,the open and diversity of the Internet has also brought a lot of challenges to the integration and cleaning of knowledge from different data sources.Based on the extensive research and analysis of the related researches at home and abroad,this paper studies the construction technology of the target KG from the two dimensions of information extraction and knowledge fusion.Firstly,for the largescale unstructured network text information,the information extraction technology of relation extraction and attribute extraction is adopted to obtain the structured knowledge.Secondly,in view of the redundancy and ambiguity existing in multi-source knowledge of the Internet,the entity linking and name disambiguation technology are used to deal with knowledge fusion.The main contributions are as follows:(1)In the research of relation extraction,aiming at the difficulty and high cost of tagging data in big data environment,a person entity relation extraction model for Chinese news corpus is proposed which use distant supervision technology to automatically generate training data.Firstly,distant supervision is used to automatically generate weakly labeled data by aligning knowledge base and the text in the corpus.Secondly,aiming at the noise data in weak tags,a relation indicator filtering algorithm based on TF-IDF is proposed for denoising.Finally,lexical features and syntactic features are extracted from the natural language process result of words and sentences respectively,and the training text is mapped to the corresponding multi-factor relation feature vectors to train the relation classifier.Experiments on large-scale real news corpus show that the person entity relation extraction model proposed in this paper is superior to other similar methods,and shows good expansibility.The property of not needing to labeled corpus makes it of great practical value.(2)In the research of attribute extraction,previous studies have shown that the accuracy of attribute feature representation in the text will directly affect the results of attribute extraction.Aiming at this key problem,a person entity attribute extraction model based on Siamese network is proposed from the perspective of learning more discriminating attribute representation.The model consists of two subnetworks.Firstly,by adopting Siamese network structure with its dual-input,the attribute encoder can learn more accurate attribute vectors by constraining the similarity between the target sentence and the parallel sentence.Then,the attribute predictor uses these attribute vectors to train an attribute classifier to extract attribute information.The experiment results on English Wikipedia data show that,compared with the traditional sequential input model,paired input enables the model to compare features intuitively,so as to more accurately summarize and learn the relevant features of distinguishing attributes.And the model reaches the-state-of-the-art of attribute extraction.(3)In the research of entity linking,existing methods usually map the context information of mention and entity information into the same semantic space,and then selects the real entity corresponding to the mention through distance measurement.However,in the existing research,both the semantic representation method of artificial design features and the semantic embedding method of constructing word vector models require a large amount of human and computing resources.In view of this situation,this paper proposes a deep entity link model based on BERT to eliminate semantic ambiguity.Firstly,the model fine-tunes BERT to obtain the vector representation of mention and entity in the same semantic space.In addition,according to the real data,a hard negative sample mining strategy is designed in the training process to promote the model to learn deeper semantic information rather than just focus on the similarity of strings.Secondly,the candidate entity lists of mentions are generated by using the existing public information.Finally,the entity disambiguation network based on multi-layer perceptron is used to select the corresponding entity of the mention from the list of candidate entities.Experiments on well-known entity link standard datasets Co NLL 2003 and TAC 2010 show that the model outperforms other representative algorithms and achieves the state of the art results of entity linking.(4)In the research of person name disambiguation,most of the existing research methods adopt artificially designed features to represent the person name mentions,and the clustering algorithm often needs to pre-define the clustering number according to the training data.To overcome the shortcomings of these two aspects,this paper proposes a deep person name disambiguation model based on non-negative matrix factorization.Firstly,the pre-training language model BERT is optimized by triplet loss function to obtain the semantic representation vector of the person name mention.Then,a clustering algorithm based on non-negative matrix factorization is proposed to classify the learnt person name representation in order to achieve the purpose of name disambiguation.This method does not need to define the number of clusters in advance,thus achieve greater practical value than the existing methods.The results on standard contest datasets We PS-1 and We PS-2 prove the validity of the model,and it is obviously superior to other related models.
Keywords/Search Tags:Knowledge Graph, Information Extraction, Deep Learning, Natural Language Processing, Machine Learning, Knowledge Fusion
PDF Full Text Request
Related items