Font Size: a A A

Research Of Annotation Based On Topic Models And Random Walks

Posted on:2014-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:J S SunFull Text:PDF
GTID:1228330401963104Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of Internet especially Web2.0, tagging technology is applied in various website widely. The web resources tagged by a short text can facilitate people quick access for mass data. Since the tagged corpus in Internet is still rare and manual tagging is time-consuming, it becomes a focus to high quality automatic annotation in academics and business. Among them, the content-based keyphrases extraction and tag recommendation based on the existing social tag sets are two main ways of automatic annotation.There has been a large amount of research and application system in keyphrases extraction and tag recommendation, the current technology, however, still faces some problems, including:●In keyphrases extraction, experts just take into account the local statistical property of words or local between-word correlation, without considering global words-document-topic relations;●In social tag recommendation, there is no topic model considering tag granularity and noise topic, also no scholar consider global tags-document-topic relations;●In the fusion of keyphrases extraction and social tag recommendation, researchers mainly adopt linear fusion, which need human specified parameters, and not fully consider the relationship between keyphrases and social tags.In view of the above issues, the main research and contribution of the paper includes: 1. Proposed a global Random-Walk-based keyphrases ranking algorithm, namely GlobalRank. The algorithm integrates local word-document statistical weight, word-word correlation and the global word-document-topic association into the global random walk, and mine the global correlation of the candidate phrases in given documents. Then, the candidate phrases are ranked based on the correlation.In order to verify the performance of ranking algorithm, it is applied to the keyphrase extraction task. The experimental results show that compared with previous algorithms limited in local characteristics, the proposed method is able to generate more accurate keyphrases, under the same conditions of candidate phrases.2. Proposed a series of social tag recommendation models and algorithms from the easier to the more advanced. First, it presented TG-LDA (Tag-granularity LDA), which is created for the phenomena extensively occurring in social network that document and tag use different granularity descriptions. Then it demonstrated the TN-LDA (Tag-granularity and Noise-aware LDA)-the TG-LDA improvement model, which can model multiple granularity and noise aware tags at the same time. Finally, to further enhance TN-LDA, it proposed an algorithm of tag recommendation where the concept of global random walk has been added into it.Experiments show that, the proposed TN-LDA model could better model tagged document and improve the tag recommendation performance effectively. Tagging with Global Random Walk not only considers the interaction of latent topics of webpage and social tag, but also combines Term-granularity and Topic-granularity to get the global correlation after integrating them in the random walk framework. The algorithm has better performance on tag recommendation as it has the advantages of both topic model and word-based method.Additionally, the generated steady-state of random walk can optimize the topic distribution of document. Thereby document can be clustered based on the distribution. It has been proved in experiment that its clustering performance is superior to Topic Modeling methods. 3. Proposed the LabelRank, which is based on a comprehensive consideration of the score of keyphrases and tags, as well as their interrelation. The algorithm utilized the idea from the biased PageRank to get the significant score of keyphrase and candidates tag. Then sort and output candidate tags and keyphrases based on the significant score. Based on the experiments on Delicious, LabelRank algorithm combined with the advantages of keyphrases and social tags and their relationship, the recommendation performance was far better than keyphrase extraction-only or social tag recommendation algorithms. Comparing with linear fusion method ignoring their relationship, it has a higher recommendation performance. And the impact from parameters adjustment on LabelRank is quite little, that shows its strong robustness.Finally an automatic document tagging demo system has been created based on the algorithm of keyphrase extraction and social tag recommendations. This system can generate the keyphrase, social tag and the fusion results of the input documents. In addition, user can evaluate the tags generated by the system or enter some tags themselves as the training corpus stored in dataset.
Keywords/Search Tags:Keyphrases Extraction, Social Tag Recommendation, Random Walk, GlobalRank, Tag Granularity, Noise Tag, TN-LDA, Fusion of Keyphrases and Tag, LabelRank
PDF Full Text Request
Related items