The Study On Technology Of Website Information Collection Based On Web Crawler

Posted on:2015-06-30

Degree:Master

Type:Thesis

Country:China

Candidate:J X Sun

Full Text:PDF

GTID:2298330467950768

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the rapidly development of Internet, it has gradually integrated into all aspects of Peopleâ€™s Daily life. Web has become the important way for people to communicate with each other and get outside information. As a valuable source of information, Web could provide all kinds of information (text, audio,video etc) with its rich expression ability and intuitive way to use. With the passage of time, the scale of Internet and the user population size are both growing fast. The usersâ€™demand is becoming diverse, how to provide for the user the information, which they are interested in, has become a problem.Now the We Media has gradually begun to rise up on the Internet, and its scale is becoming larger. It has been paid more and more attention, since there are many outstanding representatives among them. So in this thesis, using some technical means to collect information from Baidui Baijia (a We Media platform). Then the collected articles will be reorganized for second use. To achieve this goal, this thesis proposed an integration scheme based on web crawler and other technologies.The proposed integration scheme includes three parts:information collection, information extraction, information retrieval. Information collection is responsible for the collection of web pages based on Heritrix. Information extraction, based on Jsoup and DOM, is responsible for extracting information from pages and saving it in database, to transform the unstructured information into structured information. Information retrieval, based on Lucene and SSH2, is responsible for displaying the collected articles.

Keywords/Search Tags:

Information Collection, Information Extraction, Crawler, Heritrix

PDF Full Text Request

Related items

1	The Study On Technology Of Information Collection Based On Web Crawler
2	Design And Implementation Of Digital Steganography Image Acquisition System Based On Web Crawler
3	Research And Implementation Of The Information Extraction In Retrieval System-Based Heritrix
4	A Web Crawler System For Professional-town Information Based On Heritrix Framework
5	Design And Implementation Of Vertical News Search Engine Based On Heritrix
6	Designing And Implementation Of Information Collection And Classification System Based On Web Crawler
7	Design And Implementation Of Internet Tax Information Collection System Based On Web Crawler
8	Research Of Internet Information Collection System Based On Cloud Platform Web Crawler
9	Research And Implementation Of Information Acquisition System Based On Heritrix
10	Based On Templated Web Crawler Technology Of Web Page Information Extraction