Font Size: a A A

Research On Large Scale Hierarchical Classification For Internet Text

Posted on:2015-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:L HeFull Text:PDF
GTID:1108330509961070Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, web information management and access become much difficult to some extent as rapid increase in Internet data. In order to organize and manage the massive Web information in the Internet, a large scale class hierarchy of concepts or topics is used to label the web information to make information access easier. The hierarchy usually satisfies the partial order relation, typically a tree or a directed acyclic graph, and the scale is large that can reach thousands or even tens of thousands of categories. In this process, large scale hierarchical classification problem researches how to classify the web documents into the categories among the class hierarchy, which is surveyed in this paper. Besides building a network resource directory and the harmonious Internet environment, large scale hierarchical classification can also be applied to information retrieval, network resource management, green Internet, network reputation management, hazardous information filtering etc.Different with traditional text categorization, large scale hierarchical classification has its unique characteristics, such as large-scale class hierarchy, insufficient training data,the objects of classification are on the evolution of web text to social text, and so on. In this dissertation, large-scale class hierarchy, rarity of categories, lacking labeled corpus and the objects of classification of social text, and we conducted the corresponding research on these four features. The content and contributions include.1) The large scale hierarchical classification problem is surveyed. Firstly, a definition of large scale hierarchical classification problem is proposed, which is used to describe the problem in abstraction level. Meanwhile, strategies for conquering the problem are also investigated. Secondly, classification of solving methods for this problem is analyzed,and on the basis of the classification, many typical solving methods are introduced and compared. Lastly, the characteristics and applications of different solving methods are reviewed.2) For the problem of large scale of class hierarchy, we study a two-stage classification algorithm based on candidate category search. As the class hierarchy is very large, the performance of the classification is still lower. While a reduce-and-conquer strategy has been proposed to make the problem tractable, candidate search is a bottleneck in the classification. In this work, we first analyze the computational complexity of category candidate search problem, and prove that it is an NP-hard problem. Then a candidate search algorithm which adopts a greedy strategy is proposed, and we prove that the proposed greedy strategy is a local optimum choice in the heuristic solving process. In the classification stage, we find that ancestor categories may help classification of candidates. Experiments are conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that the proposed algorithm achieves a performance improvement for candidate search compared to existing methods,and further improves the classification accuracy of two-stage approaches.3) For the problem of common rarity in large scale hierarchical classification, we study a hierarchical classification method based on feature extraction via LDA. As the skewed category distribution over documents, that is, most categories have very few labeled documents, the data sparseness problem in the rare categories lead to a low classification performance. In this work, we study the problem of web-page classification over the topic taxonomy of the DMOZ directory. For this hard task, we proposed a hierarchical classification model based on Latent Dirichlet Allocation. We use LDA model as the feature extraction technique to extract latent topics to reduce the effects of data sparseness,and construct topic feature vectors associated with the corpus for training more robust classification models for rare categories. Experiments were conducted on the dataset of web pages from the Chinese Simplified branch of the DMOZ directory. The results show that our method achieves a performance improvement for rare categories over the hierarchical classification methods based on full-term and feature-word, and further improves the performance over the whole topic taxonomy.4) For the problem of the lack of labeled corpus in large scale hierarchical classification, we study a method to train classification models with non-labeled web data. Traditional text classification methods require labeled corpus to train classifiers, however, it is very difficult to obtain labeled training samples. In the work, we propose a hierarchical text classification method which does not require any labeled data. The method takes advantage of the ontological knowledge of the topic hierarchy when retrieving training documents from Web. And we use hierarchical support vector machine for training classifiers through Web corpora. To reduce the negative effect of noise data, we construct the web query by the label path, search related documents from both google and wikipedia, and group corpus with the hierarchy. The experimental results show that this method could train classifiers through non-labeled web data, and gains a better result of classification, which is competitive with the supervised classification method with labeled training samples.5) For the problem of social text in large scale hierarchical classification, we study topic modeling for microblogs. In this work, we propose a user topic mining model.For each user, the interests are divided into two parts by different ways to generate the microblogs: original interest and retweet interest. We represent a Gibbs sampling implementation for inference the parameters of UTM model, and discover not only user’s original interest, but also retweet interest. Then we combine original interest and retweet interest to compute interest words for users. Experiments on a dataset of Sina microblogs demonstrate that UTM is able to discover user interest effectively and outperforms existing topic models in this task. And we find that the interest words discovered by UTM reflect user labels, and range is much broader. However, the granularity of the interest words is too small, not achieving microblogs classification. For this shortage, we propose a supervised generative model to inference the category labels for microblogging users. u LTM model introduces a label mixture proportions for each user, and making use of the unsupervised learning machinery of topic models to discover the latent topics in microblogs. By using the label mixture proportions, we obtain a supervised generative model that has the predictive power of the task of classification. And thereby user labels are generated automatically by analyzing of microblog content with this model. We compare the predictive power of u LTM with the existing methods on the dataset of Twitter,and demonstrate that our method is competitive with the baselines in this task.In summary, in this dissertation, we aim at the four key characteristics of large scale hierarchical classification. Some key techniques, including candidate categories searching, rate categories classification, non-labeled data learning, social text modeling, are studied. These techniques are interesting and useful, and have brilliant perspective on the classification and topic mining of Internet text information.
Keywords/Search Tags:text categorization, large scale hierarchical classification, topic hierarchy, rare category, social text, topic modeling
PDF Full Text Request
Related items