Font Size: a A A

Semi-supervised document clustering with active learning

Posted on:2009-05-02Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Huang, RuizhangFull Text:PDF
GTID:2448390005460861Subject:Computer Science
Abstract/Summary:
This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model.;Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective.
Keywords/Search Tags:Document, Constraints, Active learning, Framework
Related items