Font Size: a A A

Research On Large-Scale Document Automatic Tagging Technologies

Posted on:2012-12-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:K SunFull Text:PDF
GTID:1118330362450181Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the Web2.0-based service, social bookmarking system has allocated ahuge amount of user tags by endowing the web users with the power of labeling web ob-jects. However, the lower technology barrier has also brought the reusability problem oftags which is caused by the impropriate tagging behavior of users in the social bookmark-ing system. The reusability problem of tags has now became a crucial problem a?ectingthe social bookmarking system's ability of organization information, and for other usages.As a remedy function for the reusability problem of tags, automatic tagging hasrecently drawn many attentions. Based on the deep analysis on user's profile and thegiven object, automatic tagging system could allocate a group of refined tags for user tochose. Such mechanism could implicitly guide users to contribute more tags with higherrelevance to the given object, and could become a self-feedback system for improving thereusability of tags.In this dissertation, we concern on the large-scale document automatic tagging prob-lem, and our researches include:1. A statistical Language Model framework for Tag Recommendation(LMTR) isproposed, with two di?erent kinds of LMTR models, experiments are made withdiscussions accordingly.2. The problem of optimizing tagging e?ciency is studied for accelerating the tag-ging speed of LMTR models. By analyzing the factors which a?ect the tagginge?ciency of LMTR models, a candidate-tag generation based framework for e?-ciently tagging large-scale documents is proposed. Three di?erent kinds of algo-rithms based on vector space model(VSM), tag co-occurrence and text extractiontheories for generating candidate-tags are proposed respectively. Experiments areprovided for estimating the e?ectiveness of these proposals.3. The tag quality evaluation problem has been studied for constructing a high quali-ty tag vocabulary for the automatic tagging system. Three methods for evaluatingtag quality are proposed, based on clarity measure, term frequency measures andinformation gain measures. The experimental result is provided to estimate thee?ectiveness of these methods. Then, a bayesian estimator based ranking aggre- gation method is applied, for evaluating the a?ect of tag quality score to the LMTRalgorithm.4. Based on the study of users'tagging behaviors, a hypothesis named as"descriptivehypothesis"for explaining the user's tagging behavior is porposed. And a Mini-mum Description Tag-set(MDT) based automatic tagging framework is proposedaccordingly. Instead of recommending single tags for document, the MDT frame-work integrally recommends a set of tags to describe the document more precisely.Furthermore, in order to resolve the problem of finding the minimum descriptiontag-set from a large tag vocabulary, a greedy algorithm named as palette-taggingis proposed, and a language model based method for constructing the description-function is proposed with two kinds of parameter estimation methods. Experi-mental results prove that the MDT framework could explain the users'taggingbehavior more appropriately.
Keywords/Search Tags:automatic tagging, statistical language model, large-scale document process-ing, tag quality evaluation, minimum description tag-set
PDF Full Text Request
Related items