The Application And Research Of Text Classification

Posted on:2009-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:C C Cai

Full Text:PDF

GTID:2178360272457230

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The information on the web was increased rapidly with the development of the internet. In order to use the information easily, many people focused on the search engine and the data mining. Text Categorization (TC) is very important in the search engine and the data mining. The methods based on the machine learning got a good result in the text categorization. But the problem such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages were the problems to the study of text categorization.In this paper, the research on the unlabeled data and hierarchical categorization were studied. The problems such as feature selection, feature distill, algorithm of the text categorization were discussed.The data from real world is always incomplete. So constructing the classifiers from incomplete data was an important problem. The normal way used the naive bayes classifier and EM algorithm to train the data set. But both the naive bayes classifier and EM algorithm depended on the initial data. Especially, when the number of the unlabeled data (incomplete data) was more than the labeled data. The precision of the classifier will be affected. In order to improve the result of the classification, this paper introduced a new method that was based on Bernoulli Mixture Model and EM algorithm.Common web category method mostly based on text classification, which didn't take full considerations on particularity of the web classification include semi-structured feature and the noisy information. Furthermore, the testing data and training data came from the same data sample in many web classifications, but sometimes, we need consider the sample in the testing data didn't come from the same sample. It was considered that the testing data was constituted by labeled data and unlabeled data. A web category method LUD (Learning by Unlabeled Data) was provided to improve the precise of classification. The experimental results showed that LUD method was better than common classification method; the former could improve the precise of classification and provided a method for discovering the new category.

Keywords/Search Tags:

Text categorization, Web classification, Naive bayes, EM algorithm, Incomplete data

PDF Full Text Request

Related items

1	Text Categorization Based On Naive Bayes Method
2	The Study Of Naive Bayes Text Classification System Based On Artificial Intelligence
3	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
4	Correlation Between The Text Classification. Word
5	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics
6	Research On Web Text Classification Algorithm Based On Parallelism
7	Design And Implementation Of Text Classification System Based On K-neighborhood And Naive Bayesian
8	Research On Text Classification Algorithm Based On Naive Bayes Method
9	Research And Improvement To Text Classification Algorithm
10	Chinese Text Data Classification