Font Size: a A A

Research And Implementation Of A Topic-based Search Engine

Posted on:2008-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:S G FuFull Text:PDF
GTID:2178360242466127Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the constantly changing of Web information, it's becoming more and more difficult for search engine to provide a high-quality, comprehensive and timely updated information searching service to user. The basic limitation is that it attempts to index all the Web information and services to all topics inquiries request. In contrast, topic-based search engine only covers specific topic related web information, so that its content can be deeper and its updating cycle can be shorter. Also it can meet the requirements of fast and accurate access to information resources. At present, topic-based Web search engine is becoming a hot research and development object of computer science and information industry.Firstly, this paper describes the present status of search engine development, and analyses the advantages and disadvantages of the existing search engine briefly. And then this paper designs each module and the overall architecture of the topic-based search engine by studying on general search engine technologies, and by combining the characteristics of topic-based search engine. And then this paper organizes three chapters to describe the analysis, designation, and implementation of three major modules: rules-based Chinese word segmentation algorithm, topic-based scrawling module, and Web information indexing and storage module. The rule-based Chinese word segmentation algorithm combines dictionary, part of speech, word frequency information, improved traditional word segmentation algorithm and Chinese grammar rule together, thereby it can get great accurate. After completing the design and implementation of a general scrawling function, the Topic-based scrawling module also implements a Dynamic Web crawling sub-module to solve the problems brought by Web2.0 technology. At last, Web information indexing and storage module utilizes B+ tree to index web documents. And in order to achieve a better scalability, high efficiency in both Chinese and English Web content indexing and storage, this module integrates CLucene by modifying, expanding and recompiling the source code of CLucene. Finally, this paper discusses the future work of topic-based search engine and the technologies needed further study in summary.
Keywords/Search Tags:Topic-based search engine, Chinese word segmentation, Web crawler, B+ tree index, CLucene
PDF Full Text Request
Related items