Research And Implementation Of Web Data Storage And Data Cleaning Technology Based On XML

Posted on:2009-11-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y Qiu

Full Text:PDF

GTID:2178360245454995

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, Internet greatly affects people's live style and commercial model. The resource of Web is so vast and low-cost that increasing corporation,institutions and organizations expect to mine the valuable commercial information to apply to their decision-making. However, data source of data mining and data warehouse usually comes from structured data, such as relational database, the instancy of data requirement and inconsistent definition of data format make the conversion from Web information to records of relational database and high quality data for decision-making by disposing the transformed data particularly important.This dissertation mainly makes an in-depth study and discussion on the Web data storage and data cleaning technology, and solves the problem of data redundancy by data cleaning technology, which is produced in the course of the data conversion by reason of duplicated Web information. The main contributions about the dissertation are as follows:1. On the basis of introducing XML and analysing the display characteristic of Web information, the dissertation discusses the dominance of the conversion from Web information to relational database based on XML, and On the basis of researching mutual mapping rules, a model framework is constructed, which transforms Web information into the records in relational database and applying data cleaning technology to cleaning duplicate records in database. In order to validate the validity and practicability of model framework, it is applied to the storage and cleaning of didactical knowledge of Web pages.2. Making an in-depth study and discussion on field-matching algorithms which have been used to find approximately duplicate record, such as basal field-matching algorithm,Smith-Waterman(S-W) algorithm and field- matching algorithm based on edit distance. And on the basis of analyzing the insufficiencies while applying these algorithms to the Chinese field and characteristic of duplicate field, an improved scheme based on keywords of field is proposed, which preferable meets the requirement of Recall and Precision of duplicate record. 3. Because an improved field-matching algorithm based on keywords of field is proposed, the dissertation studies the technology of keywords abstraction. It makes an in-depth study and discussion on keywords abstraction arithmetic by automatic abstracting method based on improved Co-occurrence Model, and analyzes the identity of keywords, an improved algorithm based on the identity of keywords is proposed, and whose feasibility is validated by experiment.4. Before the the conversion from XML data to records of relational database, the keywords of nodes in the XML document are extracted by combining the improved algorithm of keywords abstraction with DOM, which will become child-nodes of the nodes and lay a foundation of adopting the improved algorithm based on keywords of field during data cleaning.

Keywords/Search Tags:

Data storage, Data cleaning, Automatic abstracting, XML, Edit distance

PDF Full Text Request

Related items

1	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
2	Research Of Methods Of Data Cleaning For Hotel Entity Based On Edit Distance And Conditional Functional Dependencies
3	Research On Sorted-neighborhood Method And Its Application In Chinese Data Cleaning
4	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
5	Research On Related Algorithms For Chinese Repeated Record Cleaning
6	Improved Edit Distance Algorithm And Its Application In E-government
7	Web Information Extracting Based On Tree Edit Distance
8	Data Cleaning Algorithm And Applications
9	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
10	Research On Automatic Web Information Extraction Technique