Font Size: a A A

Web Page Classification Based On Multiple Features And Combined Multi-classifiers

Posted on:2020-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:L DengFull Text:PDF
GTID:2428330578973922Subject:Engineering
Abstract/Summary:PDF Full Text Request
The huge amount of information on the internet has exploded over time,which provides people with access to valuable resources.Web page classification is critical for website management and information retrieval,such as developing and maintaining web directories,improving the efficiency of search engines,filtering web pages,etc.However,precise web page classification can be challenging due to the semi-structured nature of web page data,variation in content and structures among different web pages and the noise information from advertisement and copyright announcement.Therefore,we devote to the research of web page classification method to achieve high performance.First,we propose a web page classification method based on fusion of textual features and structural features.HyperText Markup Language(HTML)tags in HTML documents of web pages are exploited and converted into vectors to characterize the structural features of the web pages.Then,key texts in title,meta,and hyperlinks,etc.are extracted and converted into vectors to capture textual features.Heterogeneous textual features and structural features of web pages are fused by vector concatenation for classification.Fusion of textual features and the proposed structural features is more comprehensive and the accuracy is higher than that of the single features.Secondly,we combine multiple classifiers based on confidence and implement a web page classification system based on fusion of features and combination of classifiers.Different classifiers have different characteristics and multiple classifiers can be combined to utilize the complementarity of different classifiers.We select a set of samples and calculate the classification accuracy of the samples as the confidence of the classification result.Then,multiple classifiers are combined with decision strategies such as voting and confidence comparison to give better classification result.The web page classification system includes data acquisition and processing module,feature extraction and vectorization module,sub-classifier module,and combined multi-classifiers module.Experimental results demonstrate that on Amazon dataset,7-web-genres dataset and DMOZ dataset,the accuracy is increased to 94.2%,95.4%and 95.7%,respectively.The proposed method is higher than that of related web page classification algorithms.Thirdly,we propose a classification method specifically for mobile web pages.The design of the small screen and the vertical screen of the mobile device makes the mobile web pages present the simple structure of a list.The content of the web page appears in the form of information flow,and important information appears in the front.For these characteristics of mobile web pages,the algorithm of positioning information flow is used.We extract the subject information,header information and information flow information for classification.We collect mobile web pages for experiments,and the accuracy of our proposed method reaches 97.2%.
Keywords/Search Tags:Web Page Classification, Web Page Features, Combined Multi-classifiers, Web Page Structure, Mobile Web Pages
PDF Full Text Request
Related items