Research On Large-Scale Document Automatic Tagging Technologies

Posted on:2012-12-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:K Sun

Full Text:PDF

GTID:1118330362450181

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As one of the Web2.0-based service, social bookmarking system has allocated ahuge amount of user tags by endowing the web users with the power of labeling web ob-jects. However, the lower technology barrier has also brought the reusability problem oftags which is caused by the impropriate tagging behavior of users in the social bookmark-ing system. The reusability problem of tags has now became a crucial problem a?ectingthe social bookmarking system's ability of organization information, and for other usages.As a remedy function for the reusability problem of tags, automatic tagging hasrecently drawn many attentions. Based on the deep analysis on user's profile and thegiven object, automatic tagging system could allocate a group of refined tags for user tochose. Such mechanism could implicitly guide users to contribute more tags with higherrelevance to the given object, and could become a self-feedback system for improving thereusability of tags.In this dissertation, we concern on the large-scale document automatic tagging prob-lem, and our researches include:1. A statistical Language Model framework for Tag Recommendation(LMTR) isproposed, with two di?erent kinds of LMTR models, experiments are made withdiscussions accordingly.2. The problem of optimizing tagging e?ciency is studied for accelerating the tag-ging speed of LMTR models. By analyzing the factors which a?ect the tagginge?ciency of LMTR models, a candidate-tag generation based framework for e?-ciently tagging large-scale documents is proposed. Three di?erent kinds of algo-rithms based on vector space model(VSM), tag co-occurrence and text extractiontheories for generating candidate-tags are proposed respectively. Experiments areprovided for estimating the e?ectiveness of these proposals.3. The tag quality evaluation problem has been studied for constructing a high quali-ty tag vocabulary for the automatic tagging system. Three methods for evaluatingtag quality are proposed, based on clarity measure, term frequency measures andinformation gain measures. The experimental result is provided to estimate thee?ectiveness of these methods. Then, a bayesian estimator based ranking aggre- gation method is applied, for evaluating the a?ect of tag quality score to the LMTRalgorithm.4. Based on the study of users'tagging behaviors, a hypothesis named as"descriptivehypothesis"for explaining the user's tagging behavior is porposed. And a Mini-mum Description Tag-set(MDT) based automatic tagging framework is proposedaccordingly. Instead of recommending single tags for document, the MDT frame-work integrally recommends a set of tags to describe the document more precisely.Furthermore, in order to resolve the problem of finding the minimum descriptiontag-set from a large tag vocabulary, a greedy algorithm named as palette-taggingis proposed, and a language model based method for constructing the description-function is proposed with two kinds of parameter estimation methods. Experi-mental results prove that the MDT framework could explain the users'taggingbehavior more appropriately.

Keywords/Search Tags:

automatic tagging, statistical language model, large-scale document process-ing, tag quality evaluation, minimum description tag-set

PDF Full Text Request

Related items

1	Application Research On Statistical Language Model Of Large Vocabulary Continuous Speech Recognition System
2	Statistical Shape Modeling Based On Minimum Description Length Optimization And Segmenting In Medical Images
3	Research On Credibility Evaluation Of Modeling And Simulation For Large-Scale Complex System
4	Research On Statistical Machine Translation At Document Level
5	Using Statistical Language Modeling For Ad Hoc Information Retrieval
6	Research Of Statiscal Language Model N-best Reranking Algorithm
7	Research On Segmentation Method Of Vertebrae Magnetic Resonance Images Based On Automated Hybrid Modeling
8	Research On Improvements Of Chinese Part-of-Speech Tagging System Based On Statistical Model
9	Research On Quality Evaluation Of English-Chinese Artificial Translation Based On Language Model
10	Research On Total Quality Management System Modeling And Its Implementation Techniques For Product Life Cycle