Font Size: a A A

Research On Multi Data Source Entity Matching In The Construction Of Knowledge Map

Posted on:2019-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZouFull Text:PDF
GTID:2428330548988928Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of e-commerce and mobile Internet technologies,data on the Internet has begun to show explosive growth.In large-scale data growth,information is usually loose and fragmented,and contains redundant and even wrong information.How to extract effective information from vast amounts of data and achieve information sharing among different agencies and organizations has become the key to big data analysis and artificial intelligence.As an effective means to support the extraction of structured knowledge from massive data,knowledge mapping technology has received extensive attention from the academic community.The information of a single data source has the characteristics of incomplete information description,low coverage,etc.It is difficult to satisfy the inherent huge amount of knowledge in knowledge maps and rich in content.How to efficiently combine,integrate,and integrate large-scale loose,multi-source,multi-type,and inconsistent data has become a key issue in constructing large-scale knowledge maps.This is the main reason for this article toconduct multi-source entity matching research.The main research work of this article includes the following three aspects:(1)Aiming at the shortcomings of existing Chinese word similarity calculation,based on the improvement of two existing methods,a word similarity calculation based on fusion strategy is proposed.This method firstly uses the Chinese dictionary “Hownet” as a corpus,and proposes a similarity calculation based on commonality and individuality by analyzing the semantic information distribution characteristics of Chinese words and the information structure of Chinese words in How Net.Then,a word similarity calculation based on search engine is improved.Finally,a variety of fusion strategies are used to fuse the two similarity calculation methods improved in this paper.The experimental results show that the word similarity calculation method based on the linear fusion strategy has achieved good experimental results on the data set of Chinese word similarity evaluation task of NLPCC2016.The Pearson coefficient value reaches 0.458,and the Spearman coefficient has reached0.461.(2)A multi-level entity matching algorithm based on top-K is proposed.This method expands the idea of candidate keys in relational databases to the process of entity matching.It is considered that there is a largest set of attributes in an entity,which contains multiple sets of entity candidate keys,and entity matching results are obtained by usingsimilarity of attribute candidate keys.This method can make full use of the characteristics of attribute distribution among entities,simplify the complexity of similarity calculation in entity matching process,and consider the semantic information of attribute values for the first time in the similarity calculation of attribute values of string types,using a similarity calculation of words.The method replaces the traditional string similarity calculation based on the edit distance.The experimental results show that the algorithm is stable on the real data sets of the two sets of movies and books,and the average accuracy of entity matching is above0.97.(3)Taking Baidu Encyclopedia,360 Encyclopedia,Dou Ban Movie and Mtime as data sources,and using the entity matching algorithm proposed in this paper,the video information in the above 4 data sources was integrated,and a Chinese film and television knowledge map was designed and constructed.In the process of constructing multi-source entity matching of knowledge maps,Baidu Encyclopedia is used as a reference knowledge base,and the data in 360 data bases,Dou Ban movies,and Mtime are identified and matched to a knowledge base in order.,constitute a video library knowledge base.The movie and TV knowledge base contains 208,705 movie and TV tuples.
Keywords/Search Tags:Entity Matching, Knowledge Graph, Word Similarity, Candidate key, top-K
PDF Full Text Request
Related items