Font Size: a A A

The Study On Technology Of Information Collection Based On Web Crawler

Posted on:2019-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z H WangFull Text:PDF
GTID:2428330545981664Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In the information age,the Internet has played an increasingly important role and has become an integral part of everyday life.The Internet is an important platform for information dissemination,sharing and dissemination.However,the information resources on the Internet are massive,dynamic,disorderly,pros and cons,and lack of unified organization and control.This brings great inconvenience to people's searching and obtaining information.How to get the information needed by the user accurately and quickly from the data ocean is a big problem at present.Therefore,obtaining information from the massive information platform on the Internet and classifying the information has become the hot spot of current research.Nowadays,since the media has become a new fashion in the Internet industry,the self-media platform has become an important channel for people to obtain information from the Internet platform.This article proposes to use the web crawler technology to complete the collection of information content in the platform from Sohu News Media Platform,and then use information extraction and information retrieval techniques to accurately help users locate the required information.This article first introduces the background and significance of the research,the research status and development trends at home and abroad.Combining with the actual demand,the open source reptile framework Heritrix is used to collect information from Sohu News from the media platform.Htmlparser is used to extract the information embedded in the webpage tags collected by the crawlers and store the extracted information in the local database system.Lucene implements information retrieval for data indexing in the database and the SSH2 classic Web framework,and is presented in the form of a web page interface for user's browsing and retrieval.
Keywords/Search Tags:Heritrix, Information Collection, Information Extraction, lucene, Htmlparser
PDF Full Text Request
Related items