Font Size: a A A

Research And Implementation Of The Information Extraction In Retrieval System-Based Heritrix

Posted on:2015-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:W Q WangFull Text:PDF
GTID:2298330467962121Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, the Internet has become a carrier of large amounts of information. More and more people are accustomed to obtain information through the network, and the query of information on the Internet becomes an indispensable part of people’s life. However, the network information has the following characteristics, high complexity, high speed of update, large scale and increase farst. These characteristics make the information extraction on the Internet become a challenge in the development of network information technology. How to to meet the user’s search needs that search for information on the network faster, more accurate, more comprehensive, become a hot issue at present. To solve this problem, the fastest and most effective way is the research and optimization of search engine. Information extraction is an important part of search engine, it can directly influence the accuracy and comprehensiveness of search engine. Most work of search engine optimization is the optimization and perfection of the information extraction part.Based on a variety of relevant outstanding achievements in recent years, as well as the special needs of the user, this paper designs and researches each module of the information extraction from the whole to the part, finally implements a web crawler for enterprise application.The main work of this article describes as follows:1. This paper makes a systematic study and compare of different search engines. According to several important standards, this article do some researches and comparative analysis on several open source technologies. In addition, This paper introduces some related technologies about web crawler, mainly includes, Heritrix technology, Java technology and the basic function of web crawler.2. This paper completes the design of information retrieval system and focus on the design of web crawler and the parsing of information. The design of the web crawler based on Heritrix open source with strong extensibility. Users can configure the url seeds which need to be captured, as well as the formats of the files to be parsed and output files. This paper uses Tika parse the download webpage resources. ApacheTika encapsulates many analytical packages which is convenient to process files with different formats.3. This paper focuses on the realization of information extraction system based on enterprise platform. The system mainly include URL injection, webpage crawling, information analysis, page duplicate removal and information storage. Base on the studies of different parts, the system is realized.As the experimental data shows, the function of information extraction system based on heritrix, java and tika is effective. The system can complete information extraction in the limited time and provide reliable data support for the crawler optimization.
Keywords/Search Tags:Java, heritrix, tika, information, extraction, webcrawlers
PDF Full Text Request
Related items