Font Size: a A A

Research And Implementation Of Online Entity Disambiguation Based On Entity Gene

Posted on:2019-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:L DuanFull Text:PDF
GTID:2428330623950679Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Internet has become the center of our life since the Web appeared.We transmit information on the Internet in various forms such as texts,pictures,audios and videos,accumulating an unimaginable amount of data,of which the vast majority are texts.These texts contain huge information about characters,organizations and other entities that can roughly reflect our real social life.But it is a tough and huge project to dig out the valuable information.One of the serious problems is the ambiguity of the entity.It is usually caused by the polysemy in the natural language expression,referring to the allegation of an entity appearing in many texts.For example,for a person name "John",the computer knows nothing about which person entity the word refers to.Entity disambiguation is the technology to solve the ambiguity of an enttiy.The entity disambiguation technology can be applied to many tasks such as translation system,automatic question answering system,reading aid system,semantic search system,knowledge base population and so on.It plays a very important role in the natural language processing.Entity disambiguation methods fall broadly into two categories depending on whether they depend on a predefined knowledge base,and most of them are based on a predefined knowledge base.But a knowledge base often is not complete.If the target we need to mine is not in the predefined knowledge base,those methods doesn't work.We propose an entity disambiguation method,which does not depend on a predefined knowledge base.Our method clusters entities based on the matching degree of the entity genes to achieve the disambiguation.Our method focuses mainly on the disambiguation of Person and Organization entities in Internet corpus.Entity gene is a reprensentation for entity information,proposed by us,which consists of entity word gene and entity property gene.The entity word gene characterizes an entity by entity words related with the target entity.The property gene describes an entity by its properties,such as "birthday","couple" and so on.The similarity of two pieces of gene includes the similarity of word gene and property proper.The former mainly bases on words' TF-IDF value related to the entity,and the latter is the sum of weights of common properties in two pieces of gene.Finally,the two matching models are linearly combined into one,which describes the similarity of two pieces of gene.If the similarity reaches a certain threshold,they are considered as the same one.Our method is an unsupervised algorithm without deep semantic analysis and computation,so it is able to be an online entity clustering disambiguation algorithm.And the method we proposed in this paper can be applied to target analysis and knowledge base population in massive texts.
Keywords/Search Tags:entity disambiguation, entity clustering, gene similarity, knowledge base population, tf-idf
PDF Full Text Request
Related items