Font Size: a A A

A Research On Large Scale Automatic Chinese Webpages Classification

Posted on:2007-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:H RenFull Text:PDF
GTID:2178360182488955Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, time has been changed from information lacking to extremely abundant. Nowadays, people can attain more and more digital information including texts, data, images, audios, videos, etc. But it's quite difficult to acquire information we need because most of them are semi-structured or non-structured data. For this purpose, automatic webpage classification has been put forward and researched in the field of application. It's of vital important significance to study automatic webpage classification, for it can reduce the time to clear up the online documents. It also can be convenient to information retrieval and online file management.Researches in this thesis include:1. The thesis proceed with analyzing webpages1 structure based on their feature and propose a method named Webpage Noise Filtering(WNF). WNF can eliminate HTML tag, copyright information and most ads of a webpage. Experiments in this thesis show a good efficiency. In the way of webpage theme extraction, the paper use DOM tree parsing and propose bi-gram matching method. Experiments shows that the extraction methods can effectively eliminate contents irrelative with webpages theme and reserve webpages' theme and relative information.2. Use Term-Category Weighting(TCW) method by analyzing insufficiency of classical TF/IDF formula. TCW think over three factors: importance of feature in one category, feature average distribution, importance of feature cross the whole set. TCW improve valuable features' significance and the classifier's discriminative ability.3. Use J accard coefficient instead of Cosine coefficient by analyzing the latter's insufficiency. Jaccard coefficient expresses the degree of overlapping between the document and the category. Experiments show that it can help classifier to adapt webpages' classification.In open test, the average presicion can attain 83% after large scale text taining. It proves that the classifier have high precision, low computational cost and a high rate, and accord the requirement of large scale Chinese webpages classification. The research can be applied in information retrieval, information filtering, text classification, webpage classification, etc.
Keywords/Search Tags:automatic webpage classification, webpage theme extraction, auto text classification
PDF Full Text Request
Related items