Font Size: a A A

Web Mining Research And Implementation Of Information Technology

Posted on:2011-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:H C HeFull Text:PDF
GTID:2208360302989780Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The World Wide Web becomes the world's largest public data sources, but it's difficult to make effective use of Web information resource. Most of Web information resources have the form of HTML documents. The characteristics of HTML documents decision that it can not serve as an effective data source for the popular data mining software used directly. Therefore, how to effectively collect Web information is a focused problem for Web mining to be solved.This paper studies collection information from Web to the structured database. Collection information from the Web has three processes: Web crawling, page cleaning and information extraction. Web crawling means use the computer program to automatically downloaded similar structure of the Web pages to the local machine.Page cleaning is a process which removal invalid Web page contents.The task of information extraction is makes extraction rules and use these rules distill useful information be a Web page, and stored these informations in the structured database.In this paper, we implement a program called MyCrawler to download Web pages, elaborated on the details of the program implementation such as HTTP parse, URL distill, pages store, URL Filter and some key technologies such as performance optimization, form validation. Based on the law of web page similarity, we use the URL to guide the MyCrawler downloads and user interest-related web pages. In order to purify the page, we use HTML containers tags to divide a Web page into several content blocks, and use the text density to identify the a content block is useful or useless.In the part of information extraction, we parse web pages into a DOM tree, and using XPath rules to extract structured data from HTML / XML data source. We implement an information extraction platform, which can easily generate the information extraction rules. In the end, we carry on an information collection experiments (gather information from a recruitment website) and achieved good results.
Keywords/Search Tags:web crawler, web page purification, web information extraction
PDF Full Text Request
Related items