Font Size: a A A

Web Page Search Ranking Algorithm Based On Text Categorization

Posted on:2019-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiuFull Text:PDF
GTID:2428330548985933Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
According to iResearch's iUserTracker monitoring data,in the January 2017 PC-side website category,the search engine category ranked first in terms of monthly coverage,accounting for 98.4%.It can be seen that although the Internet is now showing explosive and multi-dimensional growth,the status of search engines as the largest traffic portal remains unshakable and deserves more attention.However,many search engines often cause the topic-drift problem.The topic-drift refers to the phenomenon that the content of a web page has no connection with the domain of query keywords,which seriously affects the user's experience.Text data occupies largest proportion of information in the huge information database of the Internet,and most users use search engines to search knowledge according to key.words.Based on this,we intensively study the textual information of webpages combined with text-related technologies.In order to solve the problems of the topic-drift and related improvement algorithms that require manual establishment of domain vectors,a web page search ranking algorithm based on text categorization is proposed.The main works of the thesis are as follows:(1)We study the text categorization method based on stacked autoencoders.By dimension reduction of the stacked autoencoders,the dimension disaster problem that occurs when traditional machine learning methods deal with text problems is solved.Experimental results show that the method reduces the dimension of the original data,extracts higher-order features,and obtains higher classification accuracies;(2)This thesis proposes a web page search ranking algorithm based on text categorization.The algorithm first preprocesses the texts of the web pages and uses the bag of words model to represent these texts.Then it uses a small amount of web data to train the softmax regression classification model,which is used to predict the category score of the test web page data.We combine category scores with the BM25 information retrieval scores to get the final web page ranking result.The experimental results show that this algorithm can achieve a relatively good performance of web page ranking without manually setting up domain vectors.
Keywords/Search Tags:domain, text categorization, softmax regression classification, web page ranking, stacked autoencoders
PDF Full Text Request
Related items