Font Size: a A A

The Design And Implementation Of Intelligent Information Retrieval System

Posted on:2014-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2268330422957490Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Search engine is indispensable tool which captures information, and providessearch service for Internet users. This help the user obtain information from theInternet quickly. But under the impact of "big data" which bring the amount ofinformation and information resources’ diversification, search engine encounterednew challenge on the speed and theme relevant aspects. As a result, the nextgeneration of search engine is currently a hot issue.When search engine’s crawler obtains information in the Internet resources, thecapacity on queue’s handling for URL string is insufficient. This article designs andimplements a hash algorithm to handle data quickly. It can be built to one-to-onerelationship between real and keys, and string data in the spider’s queue is convertedinto a structure of linear table quickly. The ability of dealing with string queue and theperformance of the crawler can be improved. Then build a search engine networkenvironment and choose Heritrix as web crawler frame, and join the crawler hashalgorithm in for testing. The experimental results show that after joining hashalgorithm of web crawler, search’s efficiency and speed of fetching get obviousimprovement.For a point that search engine performs low in feedback on topic relevance. Theright method is that improve the pages’ topics relevant which web crawler hadcaptured. Trying to put genetic algorithm into the crawler, then scrape down thecontents of specific topic type and ignore the content which is nothing with theme.The strategy’s idea is that combining genetic algorithm and the vector space modelbased on the content, and ensure the integrity of the crawling on the global optimalcharacteristics of genetic algorithm, and determine the importance of the web page onthe relationship between the web pages, and identify with the theme of relevance withvector space model. After modification, using fixed keyword for testing andcomparing, the total number of pages and page number on the theme has promoted,and the proportion of theme page by about30%, improve the system for the accuracyof information.
Keywords/Search Tags:Search engine, Crawler queue, Hash algorithm, Genetic algorithm, Topicrelevance
PDF Full Text Request
Related items