Font Size: a A A

Cross-lingual Web Pages Automatic Classification Based On Frequently Co-occurring Entropy

Posted on:2012-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:L KeFull Text:PDF
GTID:2218330338468491Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Web pages automatic classification can help to better organize and access Internet information. Building a classification model needs a large amount of reliable labeled data. For rapid growth of Chinese Web pages, labeled data is not enough to training a good classification model and the data set annotation is time-consuming work. However, there are relatively sufficient English labeled data. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones. So we can consider knowledge of English to the Chinese classification. As research show, English labeled data can help to build Chinese Web pages automatic classification model. In the traditional classification model, all Co-occurrence features of test set and train set is the category knowledge, but actually not all Co-occurrence features can well be the category knowledge. So the key issue is how to choose good Co-occurrence features.For the scarcity of Chinese labeled data, while the English labeled data is rich, we propose an approach to address the cross-language Web pages automatic classification problem based on Frequently Co-occurring Entropy (FCE). This method is used for automatic classification of volume data set, and it cites the features information of the test data set. We use Frequently Co-occurring Entropy to extract better features which can express well the category knowledge. First, our algorithm uses the Google Translate. So the English labeled data can well be applied to the Chinese Web pages classification. Second, we calculated the frequently co-occurring entropy for all Chinese and English Web pages by Frequently Co-occurring Entropy and the values are sorted descending. Then, we select the part of co-occurring features as classification knowledge. Last, we build a Chinese classification model though the English labeled web pages.Work and New points in this thesis:1. We proposed a kind of classifier model based on frequency of co-occurrence entropy classifier, and this method is applied to the Cross-Language Web Pages auto Classification and can get better features.2. We build an Adapted-based Naive Bayes classification model based on frequency of co-occurrence entropy. We also build a Na?ve Bayes classification model and support vector machine classification model based on frequency of co-occurrence entropy. This model is applied to the Cross-Language Web Pages auto Classification. We also complete and analyze comparative experiments with a variety of classifications and these methods have good performance.
Keywords/Search Tags:Cross-language, Web pages automatic Classification, Frequently Co-occurring Entropy(FCE), Naive Bayes, Adapted-based Naive Bayes
PDF Full Text Request
Related items