Font Size: a A A

Research And Application On Techniques Of Lucene-Based Subject-Oriented Search System

Posted on:2012-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z R DaiFull Text:PDF
GTID:2178330335452134Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The explosive growth of network information make search engines become an indispensable tool. One of the most representative comprehensive search engine system is Google and Baidu, such search engine provides services to the public web users, which give a big favor to users to find useful information on the internet. But which comes with the convenience, the search result have a low precision. One is because the number of pages to crawl is increasing in an index level, so the web pre-processing ability on these pages is bring down, then it will lead to a mess duplicate web pages in search result; the second is because of the difference of living environment and working environment of different people, they will concern about different direction of the information. A Meteorologist hopes that the Meteorology-relevant results of the query can be prioritized in the first few pages; an agriculture worker hopes that the farm-relevant results of the query can be prioritized in the first few pages. People in different area have different requirement, so the comprehensive search engine can not fill the demands from professional staff. In this context, subject-oriented search engine comes out.The biggest difference between subject-oriented search engine and comprehensive web search engine is the former has extracted and then use the structured information of the web pages, this is useful because the small size of pages give us many convenience when we do some further process, such as purification, elimination of duplicate web pages, etc. Since every part of search engine are all linked with one another, so if we get a better web pre-processing, then we can reduce the burden on indexing, and search out the results with a higher accuracy, this allow users to have high degree of experience.Subject-oriented search system can be divided into four main parts, namely:data collection module, web pre-processing module, indexing module, search module. Among them, the page pre-processing module can be divided into purification and elimination of duplicate web pages. There have many techniques in subject-oriented search engine, here we introduced some mainly techniques. For the purpose of requirement of meteorology-oriented, made some improvements on the existing techniques, the main work is as follow:First, introduce the work process of web pages crawler Heritrix, and the detail of how to use it to crawl web pages. Added a URL matching function to make it crawl much more fitting subject-oriented pages.Second, introduced web analytic technique—HtmlParser, and gave a complete algorithm of how to use HtmlParser to parse a page in order to achieve the purpose of purification. Web page elimination has always been an indispensable part in search engine system. Described the purification algorithm in detail, and introduced fingerprint-matching elimination algorithm. Also listed and analyzed several common feature-string extraction approaches and indicated their inadequacy on precision. TextTiling segmentation algorithm is just the one to fix this inadequacy, and then we added Tongyici Cilin Expanded to solve the identification of synonyms which use commonly in Chinese writing.Third, Introduced the core technology Lucene, which provide search and indexing interfaces that used in subject-oriented search system. Described the theory of Lucene's work in detail, for the use of multi-threading, greatly improve the efficiency of indexing. Analyzed sorting process and the formulas which Lucene used in results sorting, on the basis of the original sorting way we construct a new algorithm to sort subject-oriented result in the first few pages.At last, after the research of the subject-oriented search system, design a meteorology-oriented search system. The system also includes some personal settings, such as hot-words recommendation and preview of a web page. Hot-words recommendation algorithm use the indexing files, meteorology dictionary and history search words to count a score for each term, then select the top mount of words be the hot-words. Web preview enable users know the content of a web page without open a new link window.
Keywords/Search Tags:subject-oriented search system, Lucene, TextTiling, purification, duplicate web pages, elimination
PDF Full Text Request
Related items