Font Size: a A A

Research And Implementation On Web Quality Information Extraction And Management

Posted on:2011-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:H C CuiFull Text:PDF
GTID:2178360305961306Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the web becoming the world's largest and most complete type massive information library, business intelligence which takes enterprise operation and analysis decision-making as major develop to a new level. Web is vital data resource in business intelligence system, but many problems exist in web data quality, how to obtain the needed information from Internet fast and efficiently becomes an urgent problem, web resource quality mining comes into being. Aiming at the data resource problem of web resource quality mining, in this paper web quality evaluation data was extracted and integrated, data warehouse of web resources quality was built, and it is of great significance to the future job based on OLAP analysis of web quality data cube in web mining and Outlier Detection for web resources quality.Firstly, the existing data extraction technologies were analyzed and researched; and then a web data extraction system based on HTML structure was designed and implemented. The system consists of page preprocessing, web page clustering, rule generation and data extraction. HtmlCleaner tool was used to clean pages, pages were converted to XML format and parsed to the DOM tree in the preprocessing module; page similarity was calculated by web STM clustering algorithm, and then web pages were clustered according to distance of label tree in web page clustering module; XPath technology was adopted to locate the data region in rule generation module, in each cluster the appropriate rules were generated through the inductive learning method; the content of page was extract by extraction rules. The system was proved to be practical and effective by experimental results.Finally, the extracted web resources quality evaluation data was managed by SQL Server 2005 BI, and multi-dimensional data model was designed, build and deployment of web resources quality data warehouse was implemented.
Keywords/Search Tags:Web quality mining, Web information extraction, DOM, Web quality data warehouse
PDF Full Text Request
Related items