Research And Implementation Of The Information Extraction In Retrieval System-Based Heritrix

Posted on:2015-08-16

Degree:Master

Type:Thesis

Country:China

Candidate:W Q Wang

Full Text:PDF

GTID:2298330467962121

Subject:Electronics and Communications Engineering

Abstract/Summary:

With the rapid development of network technology, the Internet has become a carrier of large amounts of information. More and more people are accustomed to obtain information through the network, and the query of information on the Internet becomes an indispensable part of peopleâ€™s life. However, the network information has the following characteristics, high complexity, high speed of update, large scale and increase farst. These characteristics make the information extraction on the Internet become a challenge in the development of network information technology. How to to meet the userâ€™s search needs that search for information on the network faster, more accurate, more comprehensive, become a hot issue at present. To solve this problem, the fastest and most effective way is the research and optimization of search engine. Information extraction is an important part of search engine, it can directly influence the accuracy and comprehensiveness of search engine. Most work of search engine optimization is the optimization and perfection of the information extraction part.Based on a variety of relevant outstanding achievements in recent years, as well as the special needs of the user, this paper designs and researches each module of the information extraction from the whole to the part, finally implements a web crawler for enterprise application.The main work of this article describes as follows:1. This paper makes a systematic study and compare of different search engines. According to several important standards, this article do some researches and comparative analysis on several open source technologies. In addition, This paper introduces some related technologies about web crawler, mainly includes, Heritrix technology, Java technology and the basic function of web crawler.2. This paper completes the design of information retrieval system and focus on the design of web crawler and the parsing of information. The design of the web crawler based on Heritrix open source with strong extensibility. Users can configure the url seeds which need to be captured, as well as the formats of the files to be parsed and output files. This paper uses Tika parse the download webpage resources. ApacheTika encapsulates many analytical packages which is convenient to process files with different formats.3. This paper focuses on the realization of information extraction system based on enterprise platform. The system mainly include URL injection, webpage crawling, information analysis, page duplicate removal and information storage. Base on the studies of different parts, the system is realized.As the experimental data shows, the function of information extraction system based on heritrix, java and tika is effective. The system can complete information extraction in the limited time and provide reliable data support for the crawler optimization.

Keywords/Search Tags:

Java, heritrix, tika, information, extraction, webcrawlers

Related items

1	The Design And Realization Of The Vertical Search Engine On The Basis Of Java
2	Design And Implementation Of Vertical News Search Engine Based On Heritrix
3	Research And Implementation Of Information Acquisition System Based On Heritrix
4	Research And Implementation Of The Vertical Search Engine System Based On JAVA With LUCENE And HERITRIX
5	The Study On Technology Of Information Collection Based On Web Crawler
6	Design And Implementation Of Digital Steganography Image Acquisition System Based On Web Crawler
7	The Study On Technology Of Website Information Collection Based On Web Crawler
8	A Web Crawler System For Professional-town Information Based On Heritrix Framework
9	Research Heritrix And Vertical Search Engine Based On Lucene
10	Research And Implementation Of Subject-oriented Vertical Search Engine On Basic Educational Resources