Font Size: a A A

Research And Implementation Of Data Acquisition Technology Of Vertical Search Engine

Posted on:2015-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2348330518473266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,vertical search engine appears in the Internet as a new kind of search engine.Under the background of explosion of Internet data in age of information,its appearance is the product of people's in pursuit of a better search experience.Generally,vertical search engines only cover one area.And it is designed to provide more refined,deeper search service than full text search engine.Overall,a vertical search engine system includes three main modules:the acquisition of structured data module,the index module and retrieval module.This paper introduces the theory and structure of the system,and then focus on acquisition of structured data.Vertical search engine data acquisition is the foundation of the entire system.Structured data is the final data form which is needed by vertical search engine.In this paper,the vertical search engine data acquisition is divided into two stages:collection of web pages and information extraction.In the part of web page collection,the author proposed the concept of theme spider.After in-depth study,the author designed and realized a video theme spider.This spider was designed to use different crawling strategy which includes crawling strategy based on tree model and crawling strategy based on label combination.This spider is able to fetch all of the web pages.In order to make the spider have the ability of incremental update,the author added database interaction module to the spider.Since the widespread application of new web development technologies,more and more theme websites are using JavaScript technology to display data.Traditional spiders can hardly handle this situation well.The author study and implement the two methods of dynamic data acquisition,which are JavaScript source code analysis by developers and browser kernel embedding.After experimental verification,it is proved that the spider has the ability of capturing dynamic data.In the part of web information extraction,the author realized information extraction of large amount of theme web pages using xpath analysis method.However,during experiments,the author found that theme websites would usually change their web page structure.Besides vertical search engine will cover new theme websites.Under these situations,developers need to do a lot of same work.The author improved old method and realized function of generating xpath template automatically.In the last part of the article,the author made a work summary and outlook.
Keywords/Search Tags:vertical search engine, theme spider, dynamic data, information extraction, structured data
PDF Full Text Request
Related items