Font Size: a A A

The Research On Key Technologies Of Search Engine Under Cloud Environment

Posted on:2017-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:J K YuFull Text:PDF
GTID:2348330488997068Subject:Information networks
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Internet, search engine has become the main entry point when people surfing on the Internet. But faced to the massive data on the Internet, the traditional search engine can not adapt to current Internet environment in the crawling speed and data storage aspect.This paper focuses on the three key technologies of search engine system under the cloud environment. The research works are done as follows:Firstly, in order to reduce the repeated pages fetched from the Internet, this paper introduces a location related text duplicate detection algorithm called SWLR(Shingling with Location Relation).SWLR is developed from the Shingling algorithm and LCS algorithm. The experimental result shows that SWLR is better than Shingling in recall and precision rate. In order to speed up the detection speed, this paper introduces a bit based filter method to filter out the not like texts. The final experimental result shows that the fast SWLR performs better than Shingling in speed and has the near same performance in recall and precision rate with SWLR.Secondly, to accelerate the query speed when using multiple key words in the index system,this paper proposed a linked list full text index model. This model is based on the inverted index model. By adding a link point to the neighbor word in the Term node, the linked list full text index can compare the neighbor word in O(1) time. The experimental result shows that the linked list full text index has a good performance in index constructing, search costs and memory consumption.Finally, this paper designs a new crawl model based on the Hadoop environment. By paralleling the analysis system and the fetch system, this crawl model could make full use the I/O resources and CPU resources. The experimental result shows that the crawl model proposed by this paper could easily be extended and load balance.
Keywords/Search Tags:search engine, web crawler, full text indexing, text duplicate detection
PDF Full Text Request
Related items