Research Of Webpage Classification Model Based On URL And Content

Posted on:2019-01-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Cheng

Full Text:PDF

GTID:2348330542955558

Subject:Signal and Information Processing

Abstract/Summary:

With the rapid development of Internet,the number of pages of explosive growth,as the carrier of information,the Internet all the time in the text that produced a great number of different themes,and the great amount of data.How to obtain the necessary information from the vast and dynamic information resources has become a key issue in the application of Internet information.At present,the classification technology in data mining is usually used to organize and archive these pages,so as to effectively improve web services around users.The research on the classification of website pages is based on the text classification,which needs to analyze the content of website pages.If only the single web page is the processing object,the computation volume is larger,which is not suitable for processing large-scale stream data.And considering from the perspective of user behavior,the part of certain people will have more brilliant preferences of HTTP accessing,therefore,fixed single classification process can’t reflect the user’s behavior characteristics,and has low efficiency.In view of the traditional problems of website page classification method based on content,this paper proposes a website page subject classification method based on the URL + Text,with the demand of different scenarios,the two kinds of classification model was designed and implemented.(1)The website page classification model based on word embedding: in view of the website page theme smoothly,military,finance,entertainment,sports and other predefined eight themes,different from traditional classification algorithm for text vector representation,word embedding model between the word and the word semantic similarity,at the same time on the web page of text extraction combining with the characteristics of website page structure to improve,and further optimize use density clustering algorithm.(2)The web page classification model based on URL + Keywords: in view of the thematic pages,using the URL itself contains the value of information have very specific keywords and text the characteristics of repeated URL segmentation algorithm is proposed,at the same time improve Text Rank keyword extraction algorithm,finally based on naive bayes algorithm model to complete the unknown website page classification.(3)The experiment verifies the feasibility and validity of the classification model in different scenarios,and gives the classification effect of the classification model on Internet web pages.

Keywords/Search Tags:

Website classification, vector representation, word embedding, Na?ve bayes

Related items

1	Dynamic Weighting Of Word Embedding And Distributed Learning Strategies
2	Research On The Representation Of Word Embedding Based On Knowledge Fusion
3	Design And Implementation Of The Server Side Of A Customer Service Robot System
4	Research On Chinese Text Classification Based On Deep Learning
5	Research On Image Classification Algorithm Based On Local Feature And Feature Representation
6	Research On The Application Of Chinese-Burmese Bilingual Sentence-level Embedding Semantic Representation Method Based On Neural Network
7	Research On Chinese Short Text Classification Based On Word Embedding
8	Representation Learning Based Word Embedding Extraction And Its Application On Sentiment Analysis
9	Text Representation And Classification Based On Deep Learning
10	Research And Application On Word Embedding Of Low Frequency Words