The Study On Technology Of Information Collection Based On Web Crawler

Posted on:2019-12-20

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Wang

Full Text:PDF

GTID:2428330545981664

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

In the information age,the Internet has played an increasingly important role and has become an integral part of everyday life.The Internet is an important platform for information dissemination,sharing and dissemination.However,the information resources on the Internet are massive,dynamic,disorderly,pros and cons,and lack of unified organization and control.This brings great inconvenience to people's searching and obtaining information.How to get the information needed by the user accurately and quickly from the data ocean is a big problem at present.Therefore,obtaining information from the massive information platform on the Internet and classifying the information has become the hot spot of current research.Nowadays,since the media has become a new fashion in the Internet industry,the self-media platform has become an important channel for people to obtain information from the Internet platform.This article proposes to use the web crawler technology to complete the collection of information content in the platform from Sohu News Media Platform,and then use information extraction and information retrieval techniques to accurately help users locate the required information.This article first introduces the background and significance of the research,the research status and development trends at home and abroad.Combining with the actual demand,the open source reptile framework Heritrix is used to collect information from Sohu News from the media platform.Htmlparser is used to extract the information embedded in the webpage tags collected by the crawlers and store the extracted information in the local database system.Lucene implements information retrieval for data indexing in the database and the SSH2 classic Web framework,and is presented in the form of a web page interface for user's browsing and retrieval.

Keywords/Search Tags:

Heritrix, Information Collection, Information Extraction, lucene, Htmlparser

PDF Full Text Request

Related items

1	Design And Implementation Of A Job Vertical Search Engine Based On Lucene And Heritrix
2	The Research And Implementation Of Torrent Information Aggregation And Extraction Model Based On RSS
3	Vertical Search Engine For Mobile Phone Information
4	The Study On Technology Of Website Information Collection Based On Web Crawler
5	Research And Implementation Of The Information Extraction In Retrieval System-Based Heritrix
6	Research And Implementation Of The Vertical Search Engine On Lucene
7	Design And Implementation Of Vertical News Search Engine Based On Heritrix
8	Research Heritrix And Vertical Search Engine Based On Lucene
9	Research On The Collection And Management Of Public Opinion In BBSes
10	Design And Implementation Of Digital Steganography Image Acquisition System Based On Web Crawler