Font Size: a A A

Multi-Feature Based Semi-supervised Learning For Large Scale Entity Classification

Posted on:2014-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:X R SunFull Text:PDF
GTID:2248330392960925Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the Internet grows, the amount of information on it has grown signifcantly. Itis an interesting and challenging problem to construct a machine-readable knowledgebase, which includes every entity and the semantic relationships between them.The emergence of collaborative online encyclopedias like Wikipedia brought lotsof opportunities for algorithms trying to understand the semantic information on theInternet. However, these algorithms tend to have noises because the information inWikipediaismostlyunstructureddata, mainlynaturallanguage. Ifanentityknowledgebase can be built with the wide coverage of Wikipedia and having highly accuratesemantic information, it will further improve the capacity of these algorithms, and alsomake possible applications like semantic search. On the other hand, as the SemanticWeb technology evolves and becomes more practical, more and more organizationsstart to manage corporate data using semantic technologies. The organizations oftenstart with modeling ontologies for their specifc business use cases. After that, it willcost a lot to fll the ontologies with instances, or migrate from legacy databases. Weare thinking of a way to utilize the data in existing structured databases or even linkedopen data, to fll in high quality entities into the ontology.The goal of this thesis is to study the means to construct such a large-scale entityknowledge base focusing on categorization. The construction should rely on as littlehumanlabelingefortaspossible, andatthesametimeensurethequalityofcategoriza-tion. This problem has the followingthree challenges. Howto collect the multi-facetedfeatures from multiple data sources for entity classifcation? How to acquire trainingdata for training models based on input ontology in a semi-automated way? How toefectively evaluate the results of large-scale entity classifcation? This thesis will introduce a semi-automated practical entity classifcation frame-work to try to tackle these challenges. The framework includes a preprocessing stageand three main stages. In a preprocessing stage, the entities in diferent data sourcesare matched and their features are integrated. In the frst stage, seed entities of eachcategory are discovered in a semi-automated way. These seed entities are used in train-ing data in the second stage of semi-supervised learning to further expand the entities.In the third stage, an efective parameter selection and evaluation is carried out alongwith the output of the entity categorization.The experiments show very positive results. In the data set of Chinese encyclo-pedias, there are quite a lot of duplicated entities. When merged, the total number ofentities outnumbered each one of the data source. The features of the matched entitiesfrom diferent data sources can complement each other. These multi-faceted featuresare quite helpful in entity classifcation. The method we propose to select and optimizerule templates can be used to discover seed entities in a semi-automated way. Thismethod turned out to have very high label efciency. And it achieves almost the sameresults compared with labeling individual entities. In the experiments, the proposedExCore algorithm can generate enough negative training instances automatically. Thisalso achieves almost the same results compared with manual labeling. To conclude,we think the proposed framework efectively solves the problem of large-scale entityclassifcation utilizing multi features and very little labeling efort.
Keywords/Search Tags:ontologypopulation, entityclassifcation, semi-supervisedlearning
PDF Full Text Request
Related items