Font Size: a A A

Research And Implementation Of Semantic Marking System Based On Entity Linking

Posted on:2019-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:T LiuFull Text:PDF
GTID:2348330542498171Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,there are more and more means to obtain data,such as obtaining data through traditional search engines,or obtaining data through various social media websites.The problem is that the amount of data is getting bigger and bigger,and invalid information is also increasing.people find it hard to get useful information from vast amounts of data.Getting useful information in the data becomes more and more important.The task of entity linking is to extract the important entities in the text and link it to the knowledge base.The purpose of entity linking is to help people get important information in the data.This paper mainly studies and realizes the semantic annotation system based on the entity linking.The text object of this paper is scientific literature data,including papers,funds and patents.The local knowledge base in the paper includes names,organization names and technical names in scientific literature.This paper builds a multi-source knowledge base combining local knowledge bases and wikis.This paper will achieve named entity linking based on multi-source knowledge.For a query,this paper searchs local knowledge base and wiki knowledge base to obtain a collection of candidate entities,and using the character-based CNN text classification algorithm and the popularity algorithm to get target entity.At the same time,this algorithm also be able to select key entities that are strongly related to the current text.Different from the traditional candidate entity disambiguation algorithms,such as classification and sorting algorithms,this paper makes use of the background of the text,classifies the candidate entities into the scientific literature classification system by means of neural network CNN classification,selects entity which has the same categoriy with current text.When there are multiple candidate entities and query entity categories are the same,the popularity algorithm is used to select the most common candidate entities.The proposed algorithm can not only realize the disambiguation of candidate entities,but also be able to select key named entities.By setting a threshold,a word entity is marked only if the distance between the candidate entity and the current text category is within a certain range.The experimental results show that the method of entity link disambiguation in this paper performs well in text annotation in the field of scientific literature.This paper carefully studied the implementation details of the open source entity link framework such as dexter and dbpedia,the key processing flow of entity link,explored how to build multi-source knowledge base and how to store it,such as building the wiki knowledge base based on anchor text,using the in-memory database PyDbLite store entity sets and candidate entities.This paper investigates the advantages and disadvantages of different named entity recognition tools,and selects AC algorithm for named entity recognition.Experiments show that AC algorithm not only has high accuracy,but also has the characteristics of fast identification and less resource consumption.In order to achieve a usable system,this paper investigates related technologies and frameworks,such as Python and Django.And also investigate how to implement the Chrome plugin.Based on the above research,the paper realizes the REST service of entity linking,and provides the API to return the annotated data according to the input text.This paper provides the function of uploading files and annotating the contents of files.In order to facilitate the use of the entity linking function,this paper implements the Chrome annotation plugin,which can mark any text in the web page according to the user's needs,so as to meet the needs of users in different scenarios.
Keywords/Search Tags:natural language processing, entities link words, cnn, multi-source knowledge base
PDF Full Text Request
Related items