Font Size: a A A

Research On Chinese Entity Linking Based On Online Encyclopedia

Posted on:2018-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:J W YuanFull Text:PDF
GTID:2348330521450675Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of new era and the ubiquity of network, the Internet has become a storehouse of massive information, so more and more people try to access knowledge from the Internet. Thus the problem of how to make users get the information of the words in the texts that they are interested in very quickly, and how to use the existing knowledge base to annotate the large number of new words in the web texts, has become a research hotspot in the field of natural language processing. Entity linking system applications recognize entity reference names from texts and link them to the corresponding entities in knowledge bases.Through entity linking, users get to know the information of the entities in texts faster and more convenient, thus it helps them to comprehend the meaning of texts faster. Meanwhile,entity linking has promoted the development of semantic web construction, knowledge base construction, information retrieval and question answering system. Over the past few years,large quantities of platforms have emerged on the network, and are used by people to communicate with each other. Hence a lot of data, which are stored as texts, are conducted with the use of these systems. In short texts, limited to the difference of user level and way of expression, there are lots of typos, abbreviations, alias, cyberspeaks, pet phrases, nick names,ambiguous and informal expressions contained inside these texts. It is ineffective to use the traditional entity recognition and linking system to recognize and link entities. In long texts,there are a large number of co-references, so how to link correctly these co-references mentions of entities to knowledge base, is a problem to be solved.Mainly work of the thesis is as follows:(1) Web crawler technology is studied. A method based on templates and regular expressions is used to automatically extract proxy IP over the network. The demand of batch acquisition of proxy IP is satisfied by the method, and stability of web crawler system is improved. Hudong encyclopedia, Baidu encyclopedia and Douban are studied and the way to access all of the URLs is determined. Based on these, network data acquisition system is designed and implemented. Then the collected data source is extracted and Chinese knowledge base and synonym thesaurus are established. Moreover, people's names are extracted from the Chinese knowledge base to expand synonym thesaurus based on Baidu Zhixin.(2) Aimed at the problem that there are a lot of abbreviation in short texts, an algorithm combined SWJTU Chinese word segmentation with dictionary and online encyclopedia for entity recognition is put forward in the thesis. In entity linking section, a method combined synonym thesaurus, redirect of Wikipedia, improved PED (Pinyin Edit Distance) and LCS(Longest Common Subsequence) is adopted. In entity disambiguation section, two disambiguation algorithms respectively based on suffix completion and based on online encyclopedia link weight are put forward as well.(3) In view of long texts entity linking, features of Wikipedia are extracted to construct the name-entity table in the thesis. In candidate entity generation section, the method of name-entity table matching, pinyin matching, popularity matching, and context-based candidate generation and filtering are adopted. In candidate entity disambiguation, a method combined literal similarity, entity context similarity and enities similarity is used.To verify the effectiveness of entity linking methods proposed in the thesis, the data of NLPCC2015 entity recognition and linking shared task is adopted as short texts experiment data set in the thesis, and the data of wikipedia texts labelled manually are used as long texts experimental data set. As the results turned out, no matter what in short texts or in long texts,the algorithms put forward in the thesis are effective.
Keywords/Search Tags:entity recognition, entity linking, entity disambiguation, suffix supplement, link weight, online encyclopedia
PDF Full Text Request
Related items