Web Mining Research And Implementation Of Information Technology

Posted on:2011-07-14

Degree:Master

Type:Thesis

Country:China

Candidate:H C He

Full Text:PDF

GTID:2208360302989780

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

The World Wide Web becomes the world's largest public data sources, but it's difficult to make effective use of Web information resource. Most of Web information resources have the form of HTML documents. The characteristics of HTML documents decision that it can not serve as an effective data source for the popular data mining software used directly. Therefore, how to effectively collect Web information is a focused problem for Web mining to be solved.This paper studies collection information from Web to the structured database. Collection information from the Web has three processes: Web crawling, page cleaning and information extraction. Web crawling means use the computer program to automatically downloaded similar structure of the Web pages to the local machine.Page cleaning is a process which removal invalid Web page contents.The task of information extraction is makes extraction rules and use these rules distill useful information be a Web page, and stored these informations in the structured database.In this paper, we implement a program called MyCrawler to download Web pages, elaborated on the details of the program implementation such as HTTP parse, URL distill, pages store, URL Filter and some key technologies such as performance optimization, form validation. Based on the law of web page similarity, we use the URL to guide the MyCrawler downloads and user interest-related web pages. In order to purify the page, we use HTML containers tags to divide a Web page into several content blocks, and use the text density to identify the a content block is useful or useless.In the part of information extraction, we parse web pages into a DOM tree, and using XPath rules to extract structured data from HTML / XML data source. We implement an information extraction platform, which can easily generate the information extraction rules. In the end, we carry on an information collection experiments (gather information from a recruitment website) and achieved good results.

Keywords/Search Tags:

web crawler, web page purification, web information extraction

PDF Full Text Request

Related items

1	Research Of Main Technologies Of Vertical Search Engine
2	Research On Web Page Classification And Information Collection
3	Based On Templated Web Crawler Technology Of Web Page Information Extraction
4	Design And Implementation Of Web Crawler For Given Page
5	The Design And Implementation Of Distributed Web Crawler System Based On Automatic Extraction Of Webpage Information
6	Research On Entity-level Search Crawler And Information Extraction
7	Research Of Web Page Purification And Replicas Detection In Search Engine
8	Vertical Search Engine For Crawling The Web Page Design And Implementation
9	Research On APK Crawler With Automatic Pagination Detection And Search Results Extraction
10	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining