Font Size: a A A

Research Of Webpage Classification Model Based On URL And Content

Posted on:2019-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y K ChengFull Text:PDF
GTID:2348330542955558Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,the number of pages of explosive growth,as the carrier of information,the Internet all the time in the text that produced a great number of different themes,and the great amount of data.How to obtain the necessary information from the vast and dynamic information resources has become a key issue in the application of Internet information.At present,the classification technology in data mining is usually used to organize and archive these pages,so as to effectively improve web services around users.The research on the classification of website pages is based on the text classification,which needs to analyze the content of website pages.If only the single web page is the processing object,the computation volume is larger,which is not suitable for processing large-scale stream data.And considering from the perspective of user behavior,the part of certain people will have more brilliant preferences of HTTP accessing,therefore,fixed single classification process can't reflect the user's behavior characteristics,and has low efficiency.In view of the traditional problems of website page classification method based on content,this paper proposes a website page subject classification method based on the URL + Text,with the demand of different scenarios,the two kinds of classification model was designed and implemented.(1)The website page classification model based on word embedding: in view of the website page theme smoothly,military,finance,entertainment,sports and other predefined eight themes,different from traditional classification algorithm for text vector representation,word embedding model between the word and the word semantic similarity,at the same time on the web page of text extraction combining with the characteristics of website page structure to improve,and further optimize use density clustering algorithm.(2)The web page classification model based on URL + Keywords: in view of the thematic pages,using the URL itself contains the value of information have very specific keywords and text the characteristics of repeated URL segmentation algorithm is proposed,at the same time improve Text Rank keyword extraction algorithm,finally based on naive bayes algorithm model to complete the unknown website page classification.(3)The experiment verifies the feasibility and validity of the classification model in different scenarios,and gives the classification effect of the classification model on Internet web pages.
Keywords/Search Tags:Website classification, vector representation, word embedding, Na?ve bayes
PDF Full Text Request
Related items