Font Size: a A A

Research On Text Classification Based On A Keyword

Posted on:2011-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q QiuFull Text:PDF
GTID:2178360305974310Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of the World Wide Web, various electronic texts are increasing rapidly. Text classification has become one of the key techniques for organizing the online information. The technology based on machine learning is popular in text classification. However, traditional supervised learning techniques typically require a large number of labeled examples to learn an accurate classifier. It is an expensive and tedious process to label enough training examples. On the other hand, people can get substantial unlabeled documents easily from the Web. At present, the problem of overcoming labeling bottlenecks is the hotspot research filed of text classification.In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, with the help of WordNet, this paper expands the keyword into a set of query terms and retrieves a set of documents from the set of unlabeled documents. Then, from the documents retrieved, a set of positive documents are mined. Thirdly, with the help of positive documents, more positive documents are extracted from the unlabeled documents. And finally, this paper trains a text classifier with these positive documents and unlabeled documents. The experiment results show that the proposed approach performs very better than the PU learner based on labeled partially positive documents.Because of plentiful knowledge of Wikipedia, the other new approach proposed in this paper builds a text classifier based on a keyword and Wikipedia knowledge, so as to avoid labeling documents manually. Firstly, this paper retrieves a set of related documents about the keyword from Wikipedia. And then, with the help of related Wikipedia pages, more positive documents are extracted from the unlabeled documents. Finally, this paper trains a text classifier with these positive documents and unlabeled documents. The experiment results show that the new proposed approach performs very competitively compared with NB-SVM, a PU learner, and NB, a supervised learner.In many real-life text classification applications, people often face no labeled documents. The two new approaches proposed in this paper build a text classifier based on a keyword. They don't need any labeled documents and are more suitable for real-world text classification applications. The approaches proposed in this paper could improve the ease-of-use of text classifier.
Keywords/Search Tags:text classification, keyword, WordNet, Wikipedia, unlabeled documents
PDF Full Text Request
Related items