Font Size: a A A

Block Based Feature Selection And Large Scale Web Page Classification

Posted on:2008-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2178360212996076Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet is now developing with an incredible speed. One of the mostobvious features of the information times is the high acceleration ofinformation generation and communication. Nowadays the"InformationExplode"widely appears all over the World Wide Web, which actually leadsby the rich information and the lack of acknowledge. Web Data Mining is atechnique which takes advantage of Data Mining algorithms to discover andextract new information from the web pages of the Internet. The web pageshavetwotraitsinevidence:highstochasticandwellstructured.Inonehanditisnearlyimpossibleforustousingtraditionalindexingmethodtosumupthegreat amount web pages. On the other hand the web pages are transmittedaccording to the Hypertext Transfer Protocol over the internet. As the usageofHTMLlanguage,thewebpages arewellstructuredonthemeansofsyntax.But it is a great pity that traditional text classification based on pure contentdoes not consider the structure as a great feature. Block based featureextractionofwebpageclassificationwouldliketotakemostadvantageofthestructure of the web page and classify the great amount web pagesautomatically. The main senses for the automatic classification in the WebData Miningcould be concluded as follows: the search engine, the automaticdynamic embedded web page advertisement and the web page sensitivedetection. In this article, I would like to discuss three points about the theoryand implementation of the large scale web page classification. Theycould besummarized as web page block detection, block based feature extraction andlargescaleclassificationbyhierarchicalSupportVectorMachine.Traditional web page classification tends to filter the HTML tags of aweb page and then directly extract the pure content features by the methodsof the Text Classification. As we all know a web page contains lots ofinformation including navigation bar, advertisements, relevant hyper linksand so on. Different information gives different contribution to explain theactual main content of the web page. Filtering the HTMLtags will definitelydestroy the structured information of a web page. On the converse thenon-relevant noises might be reserved and they play a bad role for theclassifier. In this article, block based feature selection is imported to divide the web page into blocks according to the visual priority. This method couldcapturethestructuredfeaturesandfilterthenoisesofthewebpage.In this article the theory of Visual Priority Document Tree (VPDom Tree)is invited to improve the HTML Document Tree (Dom Tree). Dom Treecould divide a web page into blocks by its HTML tags. But different blockshave different importance to the web page browser because of their differentpositions. I divide the blocks into four levels according to their priority fromlow to high. They are Noisy Block, Relevant Block, Main Block Dividableand Main Block Undividable. A VPDom Tree is a Dom Tree added in theVisual Priority. We only divide the Main Block Dividable into sub blocks soas to decrease the complexity of the HTML Parser and get the maininformation of a web page as rapidly as possible. Block based featureextractionplaysaveryvaluableroleinthelargescaleclassifier.In this article the problem of block priority detection is considered as afunction regression procedure using the poison and text features of a webpage. ABack Propagation Artificial Neural Network is imported to train theperception. The features of a block usually present as the position (Width,Height and Coordinate) and inner hyper link (Hypertext Density and BlockDensity). The training dataset I choose is 200 web pages crawled formnews.baidu.com as well as the predicting dataset is 100 web pages fromnews.google.com. The accuracy of the Main Block Detection could reachabove85%.In this article a hierarchical Support Vector Machine is used toimplementalargescalewebpageclassifier. IchooseWebPageTextFeatures(Information Gain and TF/IDF) and Web Page Structured Features (HTMLtags and Block Features) to extract the Web Page Features Vectors Space.Support Vector Machine has set up a framework and usual method to solveMachine learning problems under limited samples with both rigoroustheoretical foundation and solution to small sample, nonlinear, highdimension and local minima and other practical problems. The hierarchicalstructure could highly improve the efficiency of the classifier and makes iteasy to build a distribution computing environment. Hierarchical SVM willset up a single classifier for each inner node in the category searching tree.This will help to reduce the dimension of the searching space and acceleratetheconvergencerateoftheclassifier. In the part of experiment, we compare the accuracy and working rate ofFlat Trigram Bayes classifier and Hierarchical SVM classifier as well asadding in the Block Feature Extraction. The handcraft category directoryfrom Open Directory Project(ODP) is imported as the training and testingdataset. At the top 4 level categories of ODP, Hierarchical SVM with BlockFeature Selection gets the Micro-F1 value nearly 5% higher than traditionalHeretical SVM classifier. Theresult oftheexperiment proves that theHSVMget a better accuracy and work efficiency improves quite a lot. ClassifieraddedintheBlockFeaturewasabletogetagreataccuracypromotion.In the future research,several points shouldbevaluabletobeoptimized.New blocking algorithms could be introduced instead of Dom Tree such asVIPS. Improving the quality and reliability of the training dataset could alsoimprovetheaccuracyoftheclassifier.Tojudgethequalityofthetaxonomyisanother important job. Unsupervised Clustering could be used to definite thetaxonomytree.Web Data Mining and Web Page Classification is a long term job whilethe intelligent expert system which could classify web pages according totheir real semantic is a desirable goal. A variety of subjects need to worktogether and finally implement the well-structure of the informationcontainedovertheInternet.
Keywords/Search Tags:Classification
PDF Full Text Request
Related items