Font Size: a A A

Research On Organization Name Disambiguation On Twitter Data

Posted on:2013-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:J W WuFull Text:PDF
GTID:2268330392467951Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The ambiguity of entity is that the name of a entity could be linked to multiple concepts. In order to automatically analyse text or construct large scale knowledge base, a precise and detailed work of entity disambiguation is indispensable. With the explosive growth of the World-Wide Web and social network, how to automatically analysis and organize the related information of entities attracts more and more researchers’attention.Organization name disambiguation is a branch of entity disambiguation, which only focus on the type of organization. This paper mainly focus on organization name disambiguation on twitter data. Compared with traditional text, twitter data lack of context and exist incorrect spelling and syntactic. Worse still, for the large scale of organization names, it is unrealistic to manually label training data for each organization. So some organization name dose not exist in the labeled data, that is the training data set and test data set with no overlap. To the problems mentioned above, our work as follows:(l)Analyse the difficulties of organization name disambiguation. According to these difficulties, data analysis is conducted. There are several difficulties, such as short context, nonstandard syntactic, unbalanced distribution of organization ambiguity, non-overlap of training and test data set, and the low coverage of existing knowledge. In addition, we summarizes the existing work.(2)Research the method of organization name disambiguation based on general classification with general features. A classifier with general features is implemented, and treated as baseline system. In fact, general features are not lexical features, but the similarity between twitter information and an organization. To obtain organization related information, the home page of an organization is a good source. However, for it is hard to extract useful information sometimes, several kinds of data source are introduced here. Experiment results confirm the effectiveness of different data source and features.(3)Research a semi-supervised based optimize method to improve general classification based organization name disambiguation. The construction of general features is inevitable to introduce noise, and leads low precision and recall. And general classifier is not optimize for each organization either. Here, we extract a small amount of predict result by general classifier as labeled data, and the other data as unlabeled data, with the help of semi-supervised method re-predict for each organization. Focusing on the low performance of semi-supervised method, a fusion method is proposed to fuse the general stage result and semi-supervised stage result. Experiment results show our optimize method could improve performance to a certain extent.(4)Research a feature enhancing based method to optimize general classification. Because semi-supervised method does not make full use of the predicted result of general classifier, in this paper, for each organization, we extract new lexical features and add to original feature space, and retrain to get an organization specific adaptive classifier. Large amount of unlabeled data are introduced to overcome the influence of data sparseness. Experiment result shows the proposed feature enhancing method could improve the baseline system.
Keywords/Search Tags:Organization name, twitter, disambiguation, semi-supervised, feature enhancing
PDF Full Text Request
Related items