Font Size: a A A

News Page Re-ranking Algorithm For Specific Domains

Posted on:2016-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:C PanFull Text:PDF
GTID:2308330473457045Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasing number of web pages, and the boom of information, how to find valuable information to users from numerous data becomes a critical problem in the field of Internet. For this reason, search engine technology emerges, and becomes one of the important measures for users to obtain information from the Internet. However, general search engines often cause the topic-drift problem, which means that during the retrieval process, some of the high ranked retrieval results are independent to the query. This can result in reducing user experience.To address the topic-drift problem, our extensive case studies have indicated that news pages that belong to the same domain often contain similar key words. Motivated by this observation, this dissertation explores a news pages re-ranking algorithm for specific domains. The main contributions of this dissertation are as follows:(1) Introduce the main background and technologies of search engine, highlights include:web crawler, web page classification and web page rank.(2) Study the method of build vector model for specific domains, construct a classifier for news pages in specific domains as well. Experiment results show that this classifier has an excellent class precision.(3) Propose a news page re-ranking algorithm for specific domains—the TSRR algorithm. TSRR establishes a vector model which is independent to page rank for a specific domain and a web page information model; then it combines the vector model and the web page information model to re-rank the search results in the retrieval process for news page.TSRR’s performance is evaluated based on the criteria of customer satisfaction and precision. Experiment results on the dataset crawled for specific domains show that TSRR is excellent in performance. Compared with the ranking algorithm from Lucene, TSRR can promote the customer satisfaction performance by 17.3% and the precision performance by 41.9% on average.(4) A news page aggregation system for specific domains is designed and implemented. All the methods proposed by this dissertation are combined in the system. The implementation and the user interface of the system are introduced later.
Keywords/Search Tags:re-ranking, web page classification, domain model, web information model, search engine
PDF Full Text Request
Related items