Font Size: a A A

The Study On Technology Of Website Information Collection Based On Web Crawler

Posted on:2015-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:J X SunFull Text:PDF
GTID:2298330467950768Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapidly development of Internet, it has gradually integrated into all aspects of People’s Daily life. Web has become the important way for people to communicate with each other and get outside information. As a valuable source of information, Web could provide all kinds of information (text, audio,video etc) with its rich expression ability and intuitive way to use. With the passage of time, the scale of Internet and the user population size are both growing fast. The users’demand is becoming diverse, how to provide for the user the information, which they are interested in, has become a problem.Now the We Media has gradually begun to rise up on the Internet, and its scale is becoming larger. It has been paid more and more attention, since there are many outstanding representatives among them. So in this thesis, using some technical means to collect information from Baidui Baijia (a We Media platform). Then the collected articles will be reorganized for second use. To achieve this goal, this thesis proposed an integration scheme based on web crawler and other technologies.The proposed integration scheme includes three parts:information collection, information extraction, information retrieval. Information collection is responsible for the collection of web pages based on Heritrix. Information extraction, based on Jsoup and DOM, is responsible for extracting information from pages and saving it in database, to transform the unstructured information into structured information. Information retrieval, based on Lucene and SSH2, is responsible for displaying the collected articles.
Keywords/Search Tags:Information Collection, Information Extraction, Crawler, Heritrix
PDF Full Text Request
Related items