Font Size: a A A

Research And Implementation Of Web Data Storage And Data Cleaning Technology Based On XML

Posted on:2009-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y QiuFull Text:PDF
GTID:2178360245454995Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Internet greatly affects people's live style and commercial model. The resource of Web is so vast and low-cost that increasing corporation,institutions and organizations expect to mine the valuable commercial information to apply to their decision-making. However, data source of data mining and data warehouse usually comes from structured data, such as relational database, the instancy of data requirement and inconsistent definition of data format make the conversion from Web information to records of relational database and high quality data for decision-making by disposing the transformed data particularly important.This dissertation mainly makes an in-depth study and discussion on the Web data storage and data cleaning technology, and solves the problem of data redundancy by data cleaning technology, which is produced in the course of the data conversion by reason of duplicated Web information. The main contributions about the dissertation are as follows:1. On the basis of introducing XML and analysing the display characteristic of Web information, the dissertation discusses the dominance of the conversion from Web information to relational database based on XML, and On the basis of researching mutual mapping rules, a model framework is constructed, which transforms Web information into the records in relational database and applying data cleaning technology to cleaning duplicate records in database. In order to validate the validity and practicability of model framework, it is applied to the storage and cleaning of didactical knowledge of Web pages.2. Making an in-depth study and discussion on field-matching algorithms which have been used to find approximately duplicate record, such as basal field-matching algorithm,Smith-Waterman(S-W) algorithm and field- matching algorithm based on edit distance. And on the basis of analyzing the insufficiencies while applying these algorithms to the Chinese field and characteristic of duplicate field, an improved scheme based on keywords of field is proposed, which preferable meets the requirement of Recall and Precision of duplicate record. 3. Because an improved field-matching algorithm based on keywords of field is proposed, the dissertation studies the technology of keywords abstraction. It makes an in-depth study and discussion on keywords abstraction arithmetic by automatic abstracting method based on improved Co-occurrence Model, and analyzes the identity of keywords, an improved algorithm based on the identity of keywords is proposed, and whose feasibility is validated by experiment.4. Before the the conversion from XML data to records of relational database, the keywords of nodes in the XML document are extracted by combining the improved algorithm of keywords abstraction with DOM, which will become child-nodes of the nodes and lay a foundation of adopting the improved algorithm based on keywords of field during data cleaning.
Keywords/Search Tags:Data storage, Data cleaning, Automatic abstracting, XML, Edit distance
PDF Full Text Request
Related items