Font Size: a A A

Research On Internet Real-Time Information Acquiring And Indexing Technology

Posted on:2016-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:2298330467991824Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the fast development of internet technology, there are more and more information on the internet. Although users can get information through a search engine, but the search results are usually out-of-date. Due to the pursuit of coverage of information, traditional search engine will take a great amount of time to gather information from the internet. It will also costs a lot of time to index those information. Thus, the original information on the internet may be greatly changed when the user get the search results.The major research in this paper is researching on internet real-time information acquiring and indexing technology. The main research contents include:Firstly, based on the research of the architecture of the traditional search engine, this paper analysis the disadvantages of traditional search engine and makes a plan to improve the web crawler and indexing.Secondly, this paper introduces the real-time information on the internet, especially the news reports, and proposes a schema to acquire news reports on the internet. Then this paper builds a web crawler system based on the open source software Heritrix to prove the schema is feasible.Then, this paper makes a research on the indexing technology, and conduct experiments based on the open source software Lucene to optimize the indexing process. Besides, by optimizing the sort rule of search results, this system put emphasis on the factor of time. In the end, this paper builds a real-time search engine on the platform of J2EE by integrating the web crawler system and indexing system. More importantly, this paper proposes two schemas to gather real-time information, the scheduling strategy based on time and the scheduling strategy based on the behavior of users. The former run the web crawler automatically by a short interval, the later will monitor the search behavior of users and find the hot spot event, then run the web crawler to acquire real-time information just in time. Then this paper conduct experiments to prove that the schemas are feasible and the system is implementable.
Keywords/Search Tags:search engine, real-time information, web crawlerindexing
PDF Full Text Request
Related items