Font Size: a A A

Research On Multi-source People Web Pages Classification Based On Ensemble Learning

Posted on:2022-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2518306524975849Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Research on character attribute and behavior extraction and analysis based on character portrait generation,character relationship analysis and character behavior prediction through the Internet has become a research hotspot in the field of network information mining.Accurate character webpage classification in advance can effectively reduce workload,reduce noise and improve analysis efficiency.Existing web page classification methods mainly focus on the domain of web page classification,such as "art","business","sports",etc.while the research on the classification of people web page is relatively less.This paper is aimed at people work on web page classification problem,based on the research analysis of the existing web page classification method,according to the character web structure and content characteristics,based on integrated study of web page classification method,multiple source full class with the characters of pages of text feature,structure feature,realize the characters of web page classification accurately.The specific research contents of this paper are as follows:(1)Propose a feature extraction method suitable for the classification of characters in web pages.Through the in-depth analysis of the character web page,this paper puts forward the feature analysis and extraction methods from three levels,namely,the statistical feature,the text feature and the visual feature of the character web page.For the statistical characteristics of the character webpage,through the processing and analysis of the character webpage,combined with the URL of the webpage and the source code of the webpage,to contain the characters,keywords,tense,webpage structure and other information for statistics,to get the statistical characteristics of the webpage.For the text features of character web page body,character web page body contains a large number of character life description,contains more character attributes related text description,for this combination of TF-IDF and Word2 vec technology,extract character web page body text features.For the visual characteristics of the character web page,the use of web vision to describe the structure of the web page,design style,mainly including rendering screenshots of the web page,and according to the source code of the web page to obtain the effective pictures in the web page,vectozation as the visual characteristics of the web page.The experiment shows that all the three features can effectively classify people's web pages.(2)A classifier construction method based on ensemble learning is proposed.In view of the fact that different features have different characteristics and different classification algorithms have different advantages,different classifiers are used to classify statistical features,visual features and body text features,and attention mechanism is added to the circular convolutional neural network.In view of the different information contained in different features,the classifiers using different types of features are processed respectively to get the final classification results.The experimental results show that the multi-source web page classification method based on ensemble learning can achieve high classification accuracy.
Keywords/Search Tags:web page classification, neural networks, machine learning, ensemble learning
PDF Full Text Request
Related items