Font Size: a A A

Design And Implementation Of Vertical Search Engine Based On Improved HITS Algorithm

Posted on:2022-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:T L QiaoFull Text:PDF
GTID:2518306575460984Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of Internet applications,the data on the network has reached an unprecedented scale,and the Internet has officially entered the era of big data.It is very difficult for people to retrieve the information they want from the vast amount of information.The emergence of search engine provides a fast way for people to search the Internet.The scope of traditional search engines is the whole Internet,so the search speed and accuracy are not ideal.As a result,vertical search engine appeared.Vertical search engine will filter before obtaining data,which reduces the scope of search and ensures the high correlation between query results and topics.This thesis first introduces the development process of search engine and the research situation at home and abroad,and then introduces the basic structure,workflow and basic knowledge of search engine.Then,a topic crawler algorithm based on improved hits is proposed.This algorithm introduces the svsm model on the basis of traditional hits,and realizes the comprehensive evaluation of web content and link structure.Compared with the traditional web crawler,the experiment shows that the crawler algorithm improves the accuracy of webpage correlation judgment and the efficiency of crawling webpages.In order to accurately de duplicate the content crawled by the crawler,this paper proposes an adaptive de duplication mechanism,which is based on simhsah de duplication algorithm and introduces a bloom filter with excellent de duplication effect for short text to de duplicate both short text and long text.The experimental results show that the mechanism has a good de duplication effect on the long and short mixed text sets.Then,this thesis designs a new vertical search engine.Users input the retrieval content through the front page,and the page will return the user's query results.The vertical search engine in the data acquisition module uses the improved HITS algorithm to achieve the crawler for data acquisition.After cleaning the data,it uses the adaptive de duplication mechanism to complete the de duplication of the data,and then stores it in the elastic search index to complete the word segmentation and inverted index construction of the effective data.Finally,the function and performance of the vertical search engine are tested.The test shows that the vertical search engine can complete all the functions in the requirements,and the response time,data security and stability meet the requirements.Compared with the general search engine,the performance of the vertical search engine is better than that of the two commonly used general search engines.
Keywords/Search Tags:Vertical search engine, Topic crawler, HITS, Elasticsearc, Simhash algorithm
PDF Full Text Request
Related items