Research On Text Classification Based On A Keyword

Posted on:2011-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:Q Qiu

Full Text:PDF

GTID:2178360305974310

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid growth of the World Wide Web, various electronic texts are increasing rapidly. Text classification has become one of the key techniques for organizing the online information. The technology based on machine learning is popular in text classification. However, traditional supervised learning techniques typically require a large number of labeled examples to learn an accurate classifier. It is an expensive and tedious process to label enough training examples. On the other hand, people can get substantial unlabeled documents easily from the Web. At present, the problem of overcoming labeling bottlenecks is the hotspot research filed of text classification.In this paper, we study the problem of building a text classifier from a keyword and unlabeled documents, so as to avoid labeling documents manually. Firstly, with the help of WordNet, this paper expands the keyword into a set of query terms and retrieves a set of documents from the set of unlabeled documents. Then, from the documents retrieved, a set of positive documents are mined. Thirdly, with the help of positive documents, more positive documents are extracted from the unlabeled documents. And finally, this paper trains a text classifier with these positive documents and unlabeled documents. The experiment results show that the proposed approach performs very better than the PU learner based on labeled partially positive documents.Because of plentiful knowledge of Wikipedia, the other new approach proposed in this paper builds a text classifier based on a keyword and Wikipedia knowledge, so as to avoid labeling documents manually. Firstly, this paper retrieves a set of related documents about the keyword from Wikipedia. And then, with the help of related Wikipedia pages, more positive documents are extracted from the unlabeled documents. Finally, this paper trains a text classifier with these positive documents and unlabeled documents. The experiment results show that the new proposed approach performs very competitively compared with NB-SVM, a PU learner, and NB, a supervised learner.In many real-life text classification applications, people often face no labeled documents. The two new approaches proposed in this paper build a text classifier based on a keyword. They don't need any labeled documents and are more suitable for real-world text classification applications. The approaches proposed in this paper could improve the ease-of-use of text classifier.

Keywords/Search Tags:

text classification, keyword, WordNet, Wikipedia, unlabeled documents

PDF Full Text Request

Related items

1	Automatic Classification Of Various Types Of Documents Based On Wikipedia
2	Research On Text Stream Classification By Keywords
3	Text Classification Method Based On WordNet
4	Text Classification Based On Wikipedia Knowledge
5	A Study On Improving The Methods Of WEB Query Classification
6	Title Classification Research Of Collected Documents Based On Subject Matching
7	Research On The Short Text Stream Classification Based On The Corpuse Extension From Wikipedia
8	Research On Keyword Extraction Technology Oriented To Conversational Text
9	Research On Building Wikipedia Semantic Knowledge Base And Its Application In Text Classification
10	Study On The Text Representation Of Extraction-based Multi-documents Summarization