Cross-lingual Web Pages Automatic Classification Based On Frequently Co-occurring Entropy

Posted on:2012-04-30

Degree:Master

Type:Thesis

Country:China

Candidate:L Ke

Full Text:PDF

GTID:2218330338468491

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Web pages automatic classification can help to better organize and access Internet information. Building a classification model needs a large amount of reliable labeled data. For rapid growth of Chinese Web pages, labeled data is not enough to training a good classification model and the data set annotation is time-consuming work. However, there are relatively sufficient English labeled data. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones. So we can consider knowledge of English to the Chinese classification. As research show, English labeled data can help to build Chinese Web pages automatic classification model. In the traditional classification model, all Co-occurrence features of test set and train set is the category knowledge, but actually not all Co-occurrence features can well be the category knowledge. So the key issue is how to choose good Co-occurrence features.For the scarcity of Chinese labeled data, while the English labeled data is rich, we propose an approach to address the cross-language Web pages automatic classification problem based on Frequently Co-occurring Entropy (FCE). This method is used for automatic classification of volume data set, and it cites the features information of the test data set. We use Frequently Co-occurring Entropy to extract better features which can express well the category knowledge. First, our algorithm uses the Google Translate. So the English labeled data can well be applied to the Chinese Web pages classification. Second, we calculated the frequently co-occurring entropy for all Chinese and English Web pages by Frequently Co-occurring Entropy and the values are sorted descending. Then, we select the part of co-occurring features as classification knowledge. Last, we build a Chinese classification model though the English labeled web pages.Work and New points in this thesis:1. We proposed a kind of classifier model based on frequency of co-occurrence entropy classifier, and this method is applied to the Cross-Language Web Pages auto Classification and can get better features.2. We build an Adapted-based Naive Bayes classification model based on frequency of co-occurrence entropy. We also build a Na?ve Bayes classification model and support vector machine classification model based on frequency of co-occurrence entropy. This model is applied to the Cross-Language Web Pages auto Classification. We also complete and analyze comparative experiments with a variety of classifications and these methods have good performance.

Keywords/Search Tags:

Cross-language, Web pages automatic Classification, Frequently Co-occurring Entropy(FCE), Naive Bayes, Adapted-based Naive Bayes

PDF Full Text Request

Related items

1	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
2	Study On The Application Of Hierarchical Bayesian In Emotional Classification
3	Chinese Web Pages Based On Naive Bayesian Classification Technology Research And Application
4	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
5	Research On Naive Bayes Classifiers And Its Improved Algorithms
6	Research And Application On Naive Bayes Classification Algorithm
7	Research On Text Classification Algorithm Based On Naive Bayes Method
8	Research On The Methods Of Chinese Text Classification Using Bayes And Language Model
9	A Text Classifier About High Blood Pressure Based On Naive Bayes
10	Research On Algorithms For Naive Bayes Classification And Its Tools Based On Hadoop