Font Size: a A A

The Research And Application Of Integrated Risk Search Engine Technology

Posted on:2009-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q ChangFull Text:PDF
GTID:2178360242988571Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search for a theme is a professional search engine for a certain industry, is the subdivision and extension for the search engine, is the integration for certain category of specialized information in the website. The key technologies, including the Chinese word segmentation, the theme crawler, indexing, distributed storage; this paper focuses on crawler and indexing technology. The research includes:1. An analysis of combining the content and links to coculate relevance of the theme based is proposed. Web-based hyperlinks evaluation algorithm consider the relation between the links structural the pages, but overlooked the relevance of pages with the theme and the search will deviate from the theme of "theme drift". Content-based evaluation of the algorithm focus only on text search of the importance of the subject, while ignoring the role of the Web. A comprehensive search strategy can make use of content-based to improve content-related with topics, at the same time to use the evaluation of link structure to enhance the coverage of search resources.2. We improve the Shark-search algorithm from URL queue maintenance and retrieval time.This improve the time and space efficiency of the algorithm. We use the vector space model to calculate the theme similarity. In the course of the relativity judging between the page content and the topic, we applied the term-based vector space model which is widely used in the filed of the text classification.In the course of the relativity judging between the URL and the topic, we apply a strategy which based on the page content, the web structure, developed the hyperlink analysis method PageRank from the time performance of the web site.3. This paper provides the design of character Inverted lists based risk theme index system. We adopt a classification table of the inverted index organizational structure to improve the efficiency of the index creation. We design a index of the bulk and incremental approach to realize the dynamic updating of index document. We establish and maintain the index part of a risk theme engine combine this construction algorithm with Nutch.4. This paper design a realize a risk theme engine based on the open source project Nutch.We prove that the system can provide users with comprehensive information for risk theme information services through comparing our own search engine query results and the existing site search results .The research work is supported by key national science and technology project of the "11th Five-year" plan, "Key technology research and demonstration of Integrated Risk Guardians"(No. 2006BAD20B02).
Keywords/Search Tags:Risk theme search, Space Vector Model, PageRank algorithm, Inverted Table, Nutch
PDF Full Text Request
Related items