Font Size: a A A

Named Entity Disambiguation Based On Chinese And English Wikipedia Knowledge Base

Posted on:2016-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:N C ZuoFull Text:PDF
GTID:2298330467492032Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Word disambiguation is an important task in natural language processing. Recently, Named Entity Linking (NEL) has been widely used to solve the task of word disambiguation. NEL grounds entity mentions to their corresponding node in a Knowledge Base (KB).We studied into several popular strategies of named entity disambiguation and find out the differences between them. This paper put forward a method to solve the named entity disambiguation task based on the Wikipedia data. Meanwhile, this method can be applied to both Chinese and English based on the theory and experiments. In this paper, the main research contents and results are as follows:1. In this paper, we build up a Chinese knowledge base and source document collection. This paper refer to TAC KBP’s English knowledge base’s structure and build a Chinese knowledge base which contain over3740000entries based on Chinese Wikipedia. As for the source document collection, it contains17ambiguous mentions which may refer to61entries in the KB, the collection contains328documents.2. This paper put forward a theory to analyze and extract knowledge from the Wikipedia and build8separate datasets. These datasets are applied to entity disambiguation, including the dataset of entry’s specification name, the dataset of redirected information, the dataset of disambiguation information, the dataset of linked entities, the dataset of popularity information, the dataset of entity’s context and other entities near it, category dataset. These8separate datasets can help with the task of named entity disambiguation as well as machine translation, information retrieval, web search and intelligent system.3. This paper put forward a method to analyze candidate nodes. This paper compute a five-feature-vector for each candidate node with the algorithm of PageRank, VSM and other statistical learning theory. A five-feature-vector may reflect the similarity of the node and mention together with the node’s own popularity. String similarity, popularity, context similarity, relatedness of link entities and queries, relatedness of category are5features we extracted.4. The paper put forward a method which based on the five-feature vector and apply Decision Tree to classify. We also build up a system for Entity Linking to verify the method. This paper accomplished Entity Linking task on the data of TAC2012, TAC2013and Chinese data. The method proved to be effective with a12-percent-raise of the F1and it can be widely used.
Keywords/Search Tags:Named Entity, Disambiguation, Knowledge BaseWikipedia
PDF Full Text Request
Related items