Research And Design Of Vertical Search Engine Web Crawler

Posted on:2016-01-02

Degree:Master

Type:Thesis

Country:China

Candidate:L Du

Full Text:PDF

GTID:2298330467991899

Subject:Information security

Abstract/Summary:

PDF Full Text Request

In recent years, the rapid development of the Internet and related technologies and products become more sophisticated, having an open global resources, focusing a ton of stored as pages of text, music, pictures and more information. Faced with such a mass of information, it’s difficult for us to quickly and accurately extract useful information from vast amounts of information using a traditional search engine. In order to solve the above problems, the vertical search engine came into being.Topic-specific search engines can improve the accuracy, depth and breadth of the query, greatly improving the efficiency of people’s work and life. Firstly, this paper has made the demand and analysis of vertical search engine, then detailed research and design various technologies involved in the vertical search engine and write code to implement some function module. Finally, a blog oriented vertical search engine has been designed and implemented.The main work of this paper includes the following aspects:1) Vertical search engine for some modules has been researched and codingWebpage structured extraction module has been written based on HTML structure and probability model. According to the open source word stuttering segmentation, four kinds of Chinese segmentation methods has been realized:Maximum Probability, Hidden Markov Model, MixSegment and MixSegment with UserDict. Re-judging in the URL module has done the application of innovation, not using the classic Bloom filter algorithm, but to achieve this module, each URL occupies1bit memory address, the complexity of the algorithm is O (n), although the overall memory be more, the correct rate can be reached100%. Finally, the inverted index establishment has been analyzed and researched.2) The blog vertical search engine detailed has been designed and realizedIn terms of Web crawler to crawl, with high quality Web based Set, has used a probabilistic method to obtain a quality link to grab the next address. And improved vector space model has been used to judge theme. The original innovation has been made in obtaining blog feed address, using of a mathematical scoring method, when the noise contained on minus points, so the highest score points is the real feed address, and programming the RSS parsing module. Giving HITS, PageRank and Blog updated different weights, sorting algorithms has been designed for blog sorting. SimHash algorithm has been used to achieve Webpage duplicate removed. Hamming distance judgment module has been implemented a time complexity of o (n) program. Because MySQL’s concurrency is not high enough, MongoDB has been choosed. A cache system has been designed, it improves the user’s query and access speed.

Keywords/Search Tags:

Vertical search engine, Web crawler, Blog, Page Rank, Duplicate removal algorithm

PDF Full Text Request

Related items

1	Research And Implementation Of Tax Vertical Search Engine And Improved PageRank Algorithm
2	Design And Implementation Of A Distributed Vertical Search Engine For Blog
3	Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine
4	Research On XML-based Index And Page Rank Technology In Vertical Search
5	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
6	A Vertical Search Engine In The Field Of News
7	Research Of Main Technologies Of Vertical Search Engine
8	Vertical Search Engine Research, And Implementation
9	The Design And Implementation Of Vertical Search Engine For Position Query
10	The Research And Design Of The Vertical Search Engine For The Family Medicine In Common Use