Research On Internet Real-Time Information Acquiring And Indexing Technology

Posted on:2016-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:J W Zhang

Full Text:PDF

GTID:2298330467991824

Subject:Communication and Information System

Abstract/Summary:

With the fast development of internet technology, there are more and more information on the internet. Although users can get information through a search engine, but the search results are usually out-of-date. Due to the pursuit of coverage of information, traditional search engine will take a great amount of time to gather information from the internet. It will also costs a lot of time to index those information. Thus, the original information on the internet may be greatly changed when the user get the search results.The major research in this paper is researching on internet real-time information acquiring and indexing technology. The main research contents include:Firstly, based on the research of the architecture of the traditional search engine, this paper analysis the disadvantages of traditional search engine and makes a plan to improve the web crawler and indexing.Secondly, this paper introduces the real-time information on the internet, especially the news reports, and proposes a schema to acquire news reports on the internet. Then this paper builds a web crawler system based on the open source software Heritrix to prove the schema is feasible.Then, this paper makes a research on the indexing technology, and conduct experiments based on the open source software Lucene to optimize the indexing process. Besides, by optimizing the sort rule of search results, this system put emphasis on the factor of time. In the end, this paper builds a real-time search engine on the platform of J2EE by integrating the web crawler system and indexing system. More importantly, this paper proposes two schemas to gather real-time information, the scheduling strategy based on time and the scheduling strategy based on the behavior of users. The former run the web crawler automatically by a short interval, the later will monitor the search behavior of users and find the hot spot event, then run the web crawler to acquire real-time information just in time. Then this paper conduct experiments to prove that the schemas are feasible and the system is implementable.

Keywords/Search Tags:

search engine, real-time information, web crawlerindexing

Related items

1	Detection And Simple Use Of Time Information In Real-time Search Engine
2	The Research And Development Of Distributed Real-time Vertical Search Engine
3	The Research And Application Of Real-Time Search Engine For Large-Scale Enterprise System
4	Crawl Schedule Research For Real-time Vertical Search Engine
5	Crawl Technology Research For Real-time Vertical Search Engine
6	Research And Implementation Of A Time-based Vertical Search Engine
7	Research On The Development Of Search Engine From The Perspective Of Communication
8	Research And Implementation Of Chinese Meta Search Engine Based On Web
9	Design And Realization Of News Search Engine Based On Java
10	Research On Developing Trend And Strategy Of China's Search Engine Market