| With the greatly rapid development of information technology, web page categorizationhas become one of the most attractive focuses in research. The web page categorization is aprocess using computers to classify large quantity of web pages automatically according tosomecategorizationrules. It canorganizethe webpages orderly,improve theperformance ofinformation retrieval system and increase the availabilityof web resources. Feature selectionis a key step of web page categorization. It can influence the capability of categorizationdirectly.First, we introduce the working theory, procedure and development of web pagecategorization. Then discuss each step of the working flow. Also, we introduce severalcommon used classification algorithms and do some comparisons. During the categorizationprocessing, we mainlystudyfeature selection method, learn itsprinciple and significance. Onthe basis of some popular feature selection algorithms, we research the Mutual Information(MI) andχ~2 Statistics (CHI) algorithm deeply, finding that MI ignores the features whoseMIarenegativeand oftenbeinclinedtothewordswithlowoccurrenceprobabilities,butCHIeven pay no attention to those words and can't remove the meaningless words with highoccurrence probabilities. Besides, the two algorithms both neglect the feature probabilitiesoccurring in different categories. Against these defects, we propose some improvements tomodify MI and CHI algorithm expression. After analyzing the range of probable featureselection objects, we taketitle,maintext,hyperlinkcontent and tagwords intoaccount whichmay contain available information for categorization, propose a position-weight method,giving different weights to different position features. Considering the incompleteness ofstop-words list, we use regular expression to select nouns and verbs as the primary featuresubsetinordertopre-reducethedimensionoffeature vectorspace. We use the improved feature selection methods to classifyweb pages. The results showthat the new approach not only improves the categorization accuracy obviously, but alsodecreasethecomputercost andpromotetheefficiency ofcategorization. |