Font Size: a A A

The Application And Research Of Text Classification

Posted on:2009-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:C C CaiFull Text:PDF
GTID:2178360272457230Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The information on the web was increased rapidly with the development of the internet. In order to use the information easily, many people focused on the search engine and the data mining. Text Categorization (TC) is very important in the search engine and the data mining. The methods based on the machine learning got a good result in the text categorization. But the problem such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages were the problems to the study of text categorization.In this paper, the research on the unlabeled data and hierarchical categorization were studied. The problems such as feature selection, feature distill, algorithm of the text categorization were discussed.The data from real world is always incomplete. So constructing the classifiers from incomplete data was an important problem. The normal way used the naive bayes classifier and EM algorithm to train the data set. But both the naive bayes classifier and EM algorithm depended on the initial data. Especially, when the number of the unlabeled data (incomplete data) was more than the labeled data. The precision of the classifier will be affected. In order to improve the result of the classification, this paper introduced a new method that was based on Bernoulli Mixture Model and EM algorithm.Common web category method mostly based on text classification, which didn't take full considerations on particularity of the web classification include semi-structured feature and the noisy information. Furthermore, the testing data and training data came from the same data sample in many web classifications, but sometimes, we need consider the sample in the testing data didn't come from the same sample. It was considered that the testing data was constituted by labeled data and unlabeled data. A web category method LUD (Learning by Unlabeled Data) was provided to improve the precise of classification. The experimental results showed that LUD method was better than common classification method; the former could improve the precise of classification and provided a method for discovering the new category.
Keywords/Search Tags:Text categorization, Web classification, Naive bayes, EM algorithm, Incomplete data
PDF Full Text Request
Related items