Font Size: a A A

Study Of Vertical Search Engine

Posted on:2012-10-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1118330371465407Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Web data, searching and mining varieties of professional Web data such as the financial data, academic data, military information, sports information and so on, have become emerging hot issues. Traditional general search engines like Google or Yahoo are not designed for specialized domain of Web data, hence they cannot provide users very accurate searching results in a properly ranked list or in versatile forms and graphics. The new vertical search engine is now playing a vital role in the professional searching and mining on domain data.This thesis focuses on the vertical search engine's crawling, data integration, efficient hierarchy and applications. To summarize, our contributions are as follows:1. A novel topic-focused Web crawling model is designed. Not similar to the general search engines, vertical search engines are only interested in the specific professional domain of data and knowledge. Topic focused Web crawling could fetch the HTML documents on the Internet selectively on the basis of the theme or topic defined in advance. We propose a new algorithm to predict relevance between URL and the pre-defined theme according to the anchor text and the topic of the Web page that contains the URL. We bring out a novel Web crawling model which could fetch the topic related Web pages accurately and intelligently.2. A new unstructured "Deep Web" data integrating model is proposed. Deep Web data on the dynamic Web pages are from structured databases with strict schemes. However, from the Web crawler's perspective, structures and schemes are invisible and the HTML documents are semi-structured or unstructured. In this thesis, a novel model is described to integrate the deep Web data based on DOM (Document Object Model) tree and Crawler broker, with which we can build the structured data basis of vertical search engine.3. An efficient hierarchy of vertical search engine is studied. Being oriented to vast professional Web data and concurrent users' query, a vertical search engine should have a very efficient way of working. Many measures, such as concurrent and distributed crawling based on Hadoop, URL assignment managing, Robots exclusion, DNS optimization and incremental indexing, are taken to optimize and stabilize the topic-focused Web crawler. On the other hand, we also introduce a cache system which could help the vertical search engine system handle users'queries more quickly and efficiently.4. An academic search engine prototype—Dolphin is implemented:After collecting big amount of professional data from a certain field or domain, vertical search engines are able to provide users more precise and valuable results. An academic search engine is proposed as an example of vertical search engine in this thesis, which implements multi functions such as topic clustering, time series analysis, citations analysis and so forth. It also provides a refined ranked list based on a new ranking algorithm.During systematical study on vertical search engine, we manage to invent novel algorithms for Web data collecting and integrating, to optimize the hierarchy and modules of vertical search engine, and to implement an academic search engine prototype. The algorithms and models proposed in this thesis can also be applied in many different kinds of vertical search engines focus on different fields or domains.
Keywords/Search Tags:Vertical Search Engine, academic search, deep web, XML, ranking function
PDF Full Text Request
Related items