Font Size: a A A

The Research And Implementation Of A Small Theme Search Engine Based On Lucene And Heritrix

Posted on:2016-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:S GuFull Text:PDF
GTID:2308330482953339Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet in recent years, the information on the network is more and more complex. The internet users cannot find necessary information on their own, and there is a growing request for information search methods. But the search results provided by current general search engines are often blended with much unnecessary information, and now user begins to seek more accurate and more special content aimed search engines. So the study on the special search engine technology of great necessary.This thesis analyzes the main modules of the search engines, and introduces the background knowledge of the architecture of search engines. First of all, the structure of special search engine is split into two modules:data collection, processing and data searching. Second, the data collection module is studied combined with the analysis of the Heritrix source codes and architecture, including the parsing and allocation of URL, the implementation of muti-threads, etc. Next, the drawbacks of Heritrix in special searching are also analyzed, and some improvments are made. The promblems that URLs are parsed only aimed at special pages and that multi-thread mechanism in crawlers is disabled when searching in single webside are solved. The methods are given using regular expressions to preprocess the collected data. Combined with the Lucene information search tools, the data searching module is implemented. Then according to the requirments of the special searching, we customized a mechanism for further sorting and filtering of the returned search results. Apart from that, for the problem that Lucence tools are lack of the Chinese support, we add some optimization in supporting Chinese language. In addition, in the process of analysis and implementation, a specific programming is used to explain some announcements about implementation using this programming language.Finally, a search engine of collection and searching for prose categories information of a website is realized. The search functions are tested, and other search methods are used as comparisons. From the last search results, we can see that the accurate search results are obtained. And multi-thread programming is used to increase the processing speed.There are also some shortages and defects in the research. For example, there is no distributed mechanism to achieve search; the user interface of search engines is not optimized and not friendly enough to users. Here we will consider using Solr and DWR technology to achieve a friendly user interface.DWR is a Ajax encapsulation framework, which is convenient to implement the browser interactive. The subject search engine has not adopted a better dictionary segmentation method during the process of the Chinese word segmentation as the use of dictionary word segmentation method need a lot of manual data analysis and statistics. We will build a classified thesaurus of our own based on a appropriate dictionary library. It is more image to give the text content near the related keywords than just give the first line characters of text as a brief introduction when the search results are given.
Keywords/Search Tags:Lucene, mult-thread, Regular Expression, Heritrix, Search Engine
PDF Full Text Request
Related items