Font Size: a A A

A Vertical Search Engine In The Field Of News

Posted on:2019-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:H L XuFull Text:PDF
GTID:2428330545970140Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Vertical search engine is a domain-specific search engine.Compared with the traditional general search engine,the vertical search engine can better meet the search needs of specific groups and segment the user base.When a user enters a search keyword,the vertical search engine will accurately return relevant information for that particular field.This paper proposes and designs a small vertical search engine for the news field.For this reason has designed a news topic web crawler based on Heritrix.Thematic customization of crawlers' web crawling rules and iterations of web pages allowed Heritrix to crawl only news pages to filter out other redundant unwanted web pages.And Heritrix's crawler queue has been improved for Heritrix's inability to turn on multithreading when crawling web pages under the same domain name.The BKDRHash algorithm is introduced to calculate and generate a separate hash value for each URL of the news webpage to be crawled.Finally,the URL is evenly distributed to each reptilian thread queue according to the hash value.According to the experimental comparison,greatly improve the web crawling crawler speed.At the same time,this paper designs a text classification algorithm for the confusion of news text categories.It is an imbalanced text classification algorithm based on support vector machines.Aiming at the problem of unbalanced text dataset,this algorithm generates an interpolated sample equalization dataset by using the SMOTE algorithm and iteratively evolves it through PSO to get the best interpolation sample to support vector machine Text categorization ability is optimized.Experimental results show that the new algorithm greatly optimizes the ability of SVM to classify unbalanced text data sets.In addition,this paper also designs a PageRank web page ranking algorithm based on the topic relevance and update frequency of web pages.Based on the PageRank algorithm,the algorithm takes into account the theme relevance of the web page and web page update frequency factor was introduced to adjust the sorting priority of new web pages.Experiments show that the algorithm can effectively improve the search engine's query accuracy.Finally,this paper combines the above two algorithms and builds a news vertical search engine system based on the Lucene retrieval framework.After using Lucene to index the news data,the user can directly search for news in the search interface.The search engine system can select news categories in advance before conducting news retrieval,thereby improving the verticality and subdivision degree of news retrieval.
Keywords/Search Tags:Vertical search engine, Topic web crawler, Text classification, Web page ranking, Lucene retrieval framework
PDF Full Text Request
Related items