Research And Implementation Of Data Acquisition Technology Of Vertical Search Engine

Posted on:2015-12-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2348330518473266

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,vertical search engine appears in the Internet as a new kind of search engine.Under the background of explosion of Internet data in age of information,its appearance is the product of people's in pursuit of a better search experience.Generally,vertical search engines only cover one area.And it is designed to provide more refined,deeper search service than full text search engine.Overall,a vertical search engine system includes three main modules:the acquisition of structured data module,the index module and retrieval module.This paper introduces the theory and structure of the system,and then focus on acquisition of structured data.Vertical search engine data acquisition is the foundation of the entire system.Structured data is the final data form which is needed by vertical search engine.In this paper,the vertical search engine data acquisition is divided into two stages:collection of web pages and information extraction.In the part of web page collection,the author proposed the concept of theme spider.After in-depth study,the author designed and realized a video theme spider.This spider was designed to use different crawling strategy which includes crawling strategy based on tree model and crawling strategy based on label combination.This spider is able to fetch all of the web pages.In order to make the spider have the ability of incremental update,the author added database interaction module to the spider.Since the widespread application of new web development technologies,more and more theme websites are using JavaScript technology to display data.Traditional spiders can hardly handle this situation well.The author study and implement the two methods of dynamic data acquisition,which are JavaScript source code analysis by developers and browser kernel embedding.After experimental verification,it is proved that the spider has the ability of capturing dynamic data.In the part of web information extraction,the author realized information extraction of large amount of theme web pages using xpath analysis method.However,during experiments,the author found that theme websites would usually change their web page structure.Besides vertical search engine will cover new theme websites.Under these situations,developers need to do a lot of same work.The author improved old method and realized function of generating xpath template automatically.In the last part of the article,the author made a work summary and outlook.

Keywords/Search Tags:

vertical search engine, theme spider, dynamic data, information extraction, structured data

PDF Full Text Request

Related items

1	Research On Information Extraction Based On Vertical Search Engine
2	Research And Design On Key Technologies Of Vertical Search Engine Oriented Soybean Theme
3	The Research And Realization Of Tour Guide Information Vertical Search System
4	The Vertical Search Engine Research And Design
5	Vertical Search Engine Research, And Implementation
6	The Key Technologies And Realization Of Vertical Search Engine For Expert Information
7	The Design And Realization Of The Vertical Search Engine On The Basis Of Java
8	Research And Achievement Of The Search Strategic For The Topic Search Engine Spider
9	The Vertical Search Engine For Campus Design And Implementation,
10	Research And Implementation Of Vertical Search Engine Based On Characters Of Webpage Structure