To extract the semantic structure in text collection, a variety of unsupervised approaches has been proposed. In the context of a general "bag of words" assumption, documents turn out to be vectors containing counts of terms in them. After such a process, a sophisticated statistical framework has been created successfully by topic model, following a line of work which continuously improving the model structure.Statistical topic models are attractive because they allow for a rapid analysis and understanding of new collections of text. However, this framework cannot provide sufficient information for the problem of learning a topic hierarchy from data. It has been shown recently that the data-driven learning approaches combined with some structure and prior knowledge can be a satisfactory solution. In this paper, we review a new probabilistic framework which adds the hierarchical information within document frequency into topics to seek the more semantic structure. The hierarchical topics created by DF topic model have a natural relationship beyond the tree structure. I illustrate our approach on 20 Newsgroups to show the performance of our model in extracting hierarchy of topics.From a cognitive science perspective, the background knowledge is an important supplementary means of getting hierarchical topics. And a lot of previous work has been developed by adding side information in analyzing text data. We follow this idea in a different way. That's because document frequency comes from basic data itself. So my work is also an unsupervised learning. Finally, by the combination of DF and statistical learning processes, I want this human-interpretable decomposition of the texts to be more semantic. |