Web Page Search Ranking Algorithm Based On Text Categorization

Posted on:2019-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Liu

Full Text:PDF

GTID:2428330548985933

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

According to iResearch's iUserTracker monitoring data,in the January 2017 PC-side website category,the search engine category ranked first in terms of monthly coverage,accounting for 98.4%.It can be seen that although the Internet is now showing explosive and multi-dimensional growth,the status of search engines as the largest traffic portal remains unshakable and deserves more attention.However,many search engines often cause the topic-drift problem.The topic-drift refers to the phenomenon that the content of a web page has no connection with the domain of query keywords,which seriously affects the user's experience.Text data occupies largest proportion of information in the huge information database of the Internet,and most users use search engines to search knowledge according to key.words.Based on this,we intensively study the textual information of webpages combined with text-related technologies.In order to solve the problems of the topic-drift and related improvement algorithms that require manual establishment of domain vectors,a web page search ranking algorithm based on text categorization is proposed.The main works of the thesis are as follows:(1)We study the text categorization method based on stacked autoencoders.By dimension reduction of the stacked autoencoders,the dimension disaster problem that occurs when traditional machine learning methods deal with text problems is solved.Experimental results show that the method reduces the dimension of the original data,extracts higher-order features,and obtains higher classification accuracies;(2)This thesis proposes a web page search ranking algorithm based on text categorization.The algorithm first preprocesses the texts of the web pages and uses the bag of words model to represent these texts.Then it uses a small amount of web data to train the softmax regression classification model,which is used to predict the category score of the test web page data.We combine category scores with the BM25 information retrieval scores to get the final web page ranking result.The experimental results show that this algorithm can achieve a relatively good performance of web page ranking without manually setting up domain vectors.

Keywords/Search Tags:

domain, text categorization, softmax regression classification, web page ranking, stacked autoencoders

PDF Full Text Request

Related items

1	Text Categorization Algorithm Based On Machine Learning
2	Research And Implementation On Key Technology Of Web Text Collection And Analysis
3	News Page Re-ranking Algorithm For Specific Domains
4	Based On Bi-GRU And L-Softmax Text Classification Model
5	The Research Of Multi-source Remote Sensing Images Change Detection Based On Stacked Denoising Autoencoders
6	The Research On Cross Domain Text Classification Based On Autoencoders
7	Image Classification Method Based On Abandoned Stacked Restricted Boltzmann Machine
8	The Research On Text Categorization Technology Based On Partial Least Square
9	Research On Text Classification Based On Hybrid Model Of Deep Learning
10	Face Recognition Based On LBP And Stacked Autoencoders