Font Size: a A A

Research And Implementation Of Information Acquisition System Based On Heritrix

Posted on:2014-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:W H ZhongFull Text:PDF
GTID:2268330392462839Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity and development of the Internet, the propagation speed andbreadth of information dissemination in society are unprecedentedly expanded. It is anurgent problem for the government facing that how to find network public opinionhotspot, how to stop network rumor spreading quickly and how to make the response ofpublic opinion.The Public Opinion Monitoring System in this paper relies on Science andtechnology support projects during "Twelfth Five-Year Plan". The system is based on theagricultural product areas, focusing on public opinion analysis report and emotionsanalysis. It is a system designed to help the food safety supervision personnel. Thedevelopment and application of these features not only will significantly relieve theworkload, but also can increase the response rate to network public opinion. As one of theimportant subsystems of the Public Opinion Monitoring System, this article paysattention to the information fetching function, researches and implements this subsystem.The main work and contributions of this paper are as follows:First, based on the open source project Heritrix, we have developed our own custominformation crawling mode for the public opinion monitoring system, expanded theoriginal Heritrix processing chain. The expanded processing chain contains three filteringpatterns which are more flexible domain name filtering, non-text filtering requests andkey word filtering, highly improving the efficiency of information capture.In addition, we have added the way that node webpage guides incrementalacquisition during the information crawling process, achieving the function ofincremental crawling. We integrated the original function about weibo in the laboratory,achieving the information capturing function from weibo, news and forums these threedata sources. Then, on the basis of the information capture, using the way to extract XML filetemplate structure, we make unstructured text extraction on the news text and forum text,and basically sorts out the information captured, providing the data materials for the nextstep of text analysis.Finally, while realizing the functions of the system, we have made a great amount ofcapture experiments. We have tested each filtering chain and incremental crawling corefunctional modules, and analyzed the experimental data.The Information Acquisition System is on combination of network public opinionand the work of food safety supervision. It is a more innovative approach to regulation. Inaddition, this paper puts forward the concept of the node webpage, and with the aid ofinformation extraction results back to the crawl process.At present, The system has been successfully achieve over one hundred thousanddata records from lots of large sites including Sina, Tianya, Jike search engine and so on.
Keywords/Search Tags:Public Opinion Monitoring, Information Acquisition, InformationExtraction, Heritrix, Unstructured Information
PDF Full Text Request
Related items