Font Size: a A A

Research And Application Of Data Cleaning And XML Technologies Based On Digital Newspaper

Posted on:2010-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:J Y LvFull Text:PDF
GTID:2178360278465688Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, many enterprises have accumulated a large number of semi-structured data. One of them is the digital newspaper. The newspaper offices and magazine offices have accumulated vast amounts of data, including publishing documents, text files, picture files and all kinds of Web documents in a long time. These data are generally of an objective portrayal of the fact of the society, having a high historical value. And it become a problem to be solved that to find an appropriate way to describe and store the digital newspaper data.As the brand-new criterion on data express and data exchange, XML has a unique description mechanism for unstructured information. XML can describe the structure information and content of newspapers and magazines easily because of its structured characteristic and expandability, so XML has become the best carrier and description way of the digital newspaper data. But when turning the newspaper data into XML, it may generate a lot of errors and incomplete data because of many reasons. These wrong data would cause serious damage to the accuracy, completeness and objectivity of the information. So it is important to improve the data quality of the information by data cleaning.According to the structure of digital newspapers and the related issues, this paper mainly do the following work:(1) This paper discusses the digital newspapers' data storage technology, including file system storage, relational database storage, native XML database storage. According to the characteristics of digital newspaper, the paper researches how to design the storage model of digital newspaper and how to build the indexing structure. (2) After introducing the XML technology in detail, this paper introduces the hierarchical structure of XML which describes the newspaper data. The structure could help the storage of digital newspaper and the implementation of data cleaning operation, and also discusses the digital newspaper's backup compression method.(3) The paper discusses the data cleaning flow of digital newspaper data in detail, including overall assessment, standardization, matching and elimination of duplication, completing missing data, etc. And the paper explores the steps of each process in detail.
Keywords/Search Tags:digital newspaper, XML, data cleaning, data quality, native XML database
PDF Full Text Request
Related items