Font Size: a A A

Website Crawler And Retrieval System Based On Lucene

Posted on:2009-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q XiFull Text:PDF
GTID:2178360308978322Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the development of networks and the expansion of Web resources, how to obtain the required information by using the full text of Web information retrieval systems has become an important part of everyday life, users are increasingly concerned about how to find information more accurately and efficiently.In this paper, the Web Information Retrieval System and the related theories and techniques were introduced, and some in-depth practice was done to show how to get information from the Web Information Retrieval System. In the second chapter, the relevant theories of this paper were introduced, such as the type of search engine, Chinese word segmentation method, the inverted index and Lucene. In the third chapter, according to the features of web site, I invented two page template analysis algorithms, the first algorithm is based on the longest common sequence model, using dynamic programming methods to get a optimal solution, the original algorithm has been optimized and expanded,so I can get the website template string and other string were inserted, the second algorithm uses statistical theory to create a mathematical model, extract the common sequence's start position and the end position, because of the different length of the content, we can get different variance, so that we can extract useful contents from the web site,not only save space,but also reduce indexing and search indexing time, at the end, the advantages and shortcomings of two algorithm were compared. in the fourth chapter, a Java Web Spider were introduced, including heterogeneous data processing, such as how to extract contents from word, pdf, rtf and etc. and then,a new HTML document analytical methods and the use of multi-threading were introduced. In the fivth Chapter, a Web page crawling system was introduced, it can automatically download information from the Internet, including the content, picture and next page, in order to enhance the efficiency and speed of information retrieval, Lucene was added to index information.In this paper, I design and analyse two algorithms, and a lot of software and programming lanuage such as Oracle, Tomcat, Jsp, Java, Eclipse and Lucene were used, and more importantly, I created a HTML analytical method and integrate Lucene to this system to reduce retrieval time and improve the efficiency of search.
Keywords/Search Tags:Common sequence, Web spider, Lucene, Inverted index, Full-text Retrieval
PDF Full Text Request
Related items