Website Crawler And Retrieval System Based On Lucene

Posted on:2009-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:Q Xi

Full Text:PDF

GTID:2178360308978322

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the development of networks and the expansion of Web resources, how to obtain the required information by using the full text of Web information retrieval systems has become an important part of everyday life, users are increasingly concerned about how to find information more accurately and efficiently.In this paper, the Web Information Retrieval System and the related theories and techniques were introduced, and some in-depth practice was done to show how to get information from the Web Information Retrieval System. In the second chapter, the relevant theories of this paper were introduced, such as the type of search engine, Chinese word segmentation method, the inverted index and Lucene. In the third chapter, according to the features of web site, I invented two page template analysis algorithms, the first algorithm is based on the longest common sequence model, using dynamic programming methods to get a optimal solution, the original algorithm has been optimized and expanded,so I can get the website template string and other string were inserted, the second algorithm uses statistical theory to create a mathematical model, extract the common sequence's start position and the end position, because of the different length of the content, we can get different variance, so that we can extract useful contents from the web site,not only save space,but also reduce indexing and search indexing time, at the end, the advantages and shortcomings of two algorithm were compared. in the fourth chapter, a Java Web Spider were introduced, including heterogeneous data processing, such as how to extract contents from word, pdf, rtf and etc. and then,a new HTML document analytical methods and the use of multi-threading were introduced. In the fivth Chapter, a Web page crawling system was introduced, it can automatically download information from the Internet, including the content, picture and next page, in order to enhance the efficiency and speed of information retrieval, Lucene was added to index information.In this paper, I design and analyse two algorithms, and a lot of software and programming lanuage such as Oracle, Tomcat, Jsp, Java, Eclipse and Lucene were used, and more importantly, I created a HTML analytical method and integrate Lucene to this system to reduce retrieval time and improve the efficiency of search.

Keywords/Search Tags:

Common sequence, Web spider, Lucene, Inverted index, Full-text Retrieval

PDF Full Text Request

Related items

1	The Research Of Full-text Search Engine Key Technology Based On Lucene
2	The Research And Implementation Of Full-Text System Based On Lucene And Textual Image
3	The Research And Implementation Of Full-text Retrieval System Based On Lucene
4	A Research Of Full-Text Retrieval Based On Inverted Index
5	Design And Implementation Of Heterogeneous Document Library Full-text Retrieval System
6	Research On Full-text Retrieval Technology In Education Resource Sharing System
7	Military Retrieval System Design And Implementation
8	Based On Research And Optimization Lucene Inverted Index Performance
9	Research On Full-Text Retrieval Technology For XML Documents Based On Inverted Index
10	The Research Of Full-Text Retrieval And Its Relative Security Technology For Chinese