Font Size: a A A

A Semi-supervised Based Method For Entity Set Expansion

Posted on:2016-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y F MaFull Text:PDF
GTID:2298330467479681Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the resources of documents are richer. One of the most important tasks of data mining is digging up the effective information from these resources. Some people want to generate the structured information from documents in the internet for users to access easily. In such a way of thinking, the concept of knowledge graph have been proposed for the convenient of users to access data by some search engine companies (such as Google, Baidu, etc.) in with the entities are grouped by their categories and relationships.The main research content of this paper is how to expand entity set from the documents crawled form the Internet. Entity set expansion is the task of getting more entities in the same category from some entity seeds. This is a basic task for semantic research, question-answer system and knowledge database. Traditional methods for entity set expansion mainly consider the co-occurrence relationship between entities, and expand the set iteratively which cause the semantic drift problem and have poor precision. In this paper, we propose an algorithm for these problems. In our method, we get candidate entities by wrappers, use the topic model to get the semantic information and expand entity sets by the label propagation algorithm.The work of this paper is mainly divided into two parts:the candidate term extraction and the entity set expansion algorithm. In the first part of the main task is to use an automated-building wrapper to extract candidates which appear in the similar context of seeds. In the second part of our method is grouped candidates into lists by the document structure, and determine whether the candidate word should be extended to the entity set. In this paper, the main research work have been shown as follows:1. The traditional mining methods according to the word or template for the candidate human cost is larger, and the way to get candidate entities directly according to the word segmentation tools cannot effectively find new words, and there are two methods of the defects of the recall rate is too low. For this, this paper proposes a context information using seed words automatically learn the wrapper method, to extract candidate entities, and guarantee a certain level of recall rate;2. There are a large number of repetitions of candidate entities is lower, and the seed word words and comparatively large difference, affected the accuracy of the final steps set extension, this paper constructed for seeds contain words, wrapper, candidate words a mixture of three types of nodes in the graph model, and in the use of random walk algorithm for mining the candidate word confidence level, in order to carries on the preliminary screening of the candidate words; 3. The word as a single entity exists ambiguity problem, this paper in the entity set extension, the candidate will be in the same paragraph word for word list, you can assume the list contains all the words described theme is consistent, the word lists in the extension process as a whole to consider, to avoid the ambiguity of the problems of single words may;4. The traditional entity set extension methods did not consider the semantic information of extensions word, cause in the process of extension does not belong to the category entities are involved. In this paper, using the LDA model, mining entity word list context corresponds to the theme, rich semantic information in the process of entity expansion, solve the problem of the traditional method of semantic deflection.5. In order to consider the seed word with candidate word co-occurrence relation and the semantic relations between, this paper build a word contains seeds, candidate word, word lists, and the word list context corresponds to the theme information mixed picture of the four nodes. Because the seed word quantity is less, using a semi-supervised learning the label propagation algorithm, and set the entity for the unit with the word list.
Keywords/Search Tags:entity set extension, wrapper, semantic drift, topic model, label propagation
PDF Full Text Request
Related items