Font Size: a A A

Areas Of The Theme-based Web Information Retrieval Techniques

Posted on:2007-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:X A LiFull Text:PDF
GTID:2208360185982274Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of network technology, there is exponential growth of network resources. How to search and seek interested information from internet becomes necessary. Search engines like Google and Baidu well meet user's requirements to some extent. Based on traditional full-text retrieval technology, the current general web search engine mainly focuses on sampling web page data quickly and completely ,large scale data's indexing and storing, the searching results' relevance sorting, millisecond-degree response time, distributed processing and load balancing, natural language processing and etc.But as to a general search engine, it's difficult to gather information of each domain and topic. Even though fully gathered, because of width of domain and topic, it's also difficult to be accurate and professional, which results in many useless information in searching results. Domain and topic specific search engines solve the problems, which only provide web information retrieval services for specific and special topic or domain, win its own position in search engine family, and for its high pertinency and accuracy, users give high satisfaction.This thesis mainly studies technologies on domain and topic specific search engines. The content focuses on web pages' crawling and processing, Chinese word-segmentation, text categorization, web page sorting, index and search etc.The contributions of the thesis are:(1) Study and propose a block-based web page main content block extraction algorithm, which doesn't require complicated machine learning method and is quick. Experiments performed prove the algorithm has high accuracy and recall for content block. The algorithm can also be applied to remove the repeated storage of non-primray content blocks from web pages, which can save external memory storage spaces.(2) Propose and implement an algorithm of text categorization based on...
Keywords/Search Tags:information retrieval, search engine, word segmentation, text categorization, decision tree
PDF Full Text Request
Related items