Font Size: a A A

Research And Design Of Vertical Search Engine Web Crawler

Posted on:2016-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:L DuFull Text:PDF
GTID:2298330467991899Subject:Information security
Abstract/Summary:PDF Full Text Request
In recent years, the rapid development of the Internet and related technologies and products become more sophisticated, having an open global resources, focusing a ton of stored as pages of text, music, pictures and more information. Faced with such a mass of information, it’s difficult for us to quickly and accurately extract useful information from vast amounts of information using a traditional search engine. In order to solve the above problems, the vertical search engine came into being.Topic-specific search engines can improve the accuracy, depth and breadth of the query, greatly improving the efficiency of people’s work and life. Firstly, this paper has made the demand and analysis of vertical search engine, then detailed research and design various technologies involved in the vertical search engine and write code to implement some function module. Finally, a blog oriented vertical search engine has been designed and implemented.The main work of this paper includes the following aspects:1) Vertical search engine for some modules has been researched and codingWebpage structured extraction module has been written based on HTML structure and probability model. According to the open source word stuttering segmentation, four kinds of Chinese segmentation methods has been realized:Maximum Probability, Hidden Markov Model, MixSegment and MixSegment with UserDict. Re-judging in the URL module has done the application of innovation, not using the classic Bloom filter algorithm, but to achieve this module, each URL occupies1bit memory address, the complexity of the algorithm is O (n), although the overall memory be more, the correct rate can be reached100%. Finally, the inverted index establishment has been analyzed and researched.2) The blog vertical search engine detailed has been designed and realizedIn terms of Web crawler to crawl, with high quality Web based Set, has used a probabilistic method to obtain a quality link to grab the next address. And improved vector space model has been used to judge theme. The original innovation has been made in obtaining blog feed address, using of a mathematical scoring method, when the noise contained on minus points, so the highest score points is the real feed address, and programming the RSS parsing module. Giving HITS, PageRank and Blog updated different weights, sorting algorithms has been designed for blog sorting. SimHash algorithm has been used to achieve Webpage duplicate removed. Hamming distance judgment module has been implemented a time complexity of o (n) program. Because MySQL’s concurrency is not high enough, MongoDB has been choosed. A cache system has been designed, it improves the user’s query and access speed.
Keywords/Search Tags:Vertical search engine, Web crawler, Blog, Page Rank, Duplicate removal algorithm
PDF Full Text Request
Related items