Research On Topic-oriented Semi-structured Data Integration Methods

Posted on:2019-01-04

Degree:Master

Type:Thesis

Country:China

Candidate:H L Su

Full Text:PDF

GTID:2348330566964280

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Currently,the popularization and application of the Internet have changed the way that people publish and access information.Almost all organizations and users will choose the Internet to publish data.However,due to the Internet provides a variety of distribution formats and there is no uniform format requirement,institutional and user-published semi-structured data tables tend to have inconsistent logical structure,which brings great challenges to the users who need to collect information in the same domain.How to structure and standardize these structured data tables becomes a problem to be solved.In this paper,we study the problem and put forward the normalization,attribute dependence and candidate key identification of the topic-oriented semi-structured data table.The main contributions are as follows:(1)Proposed semi-structured data table integration framework.For the integration method of semi-structured data table,the overall framework is proposed.The whole process of data processing is described.The concepts of normalized tables,non-normalized tables,cells,attribute reduction and difference functions are formally defined.(2)Proposed a non-normalized table standardization method.According to the definition of the table,a method of converting non-normalized tables that do not conform to the first normal form(1NF)into 1NF normal tables is proposed.By comprehensively analyzing the header characteristics of non-normalized tables,a header-based tabular specification method is proposed,which includes identifying non-normalized tables,structural transformations of non-canonical headers,and extracting attribute dependencies in the header.(3)A theme oriented attribute dependence and candidate key recognition method are proposed.Inspired by the attribute reduction algorithm of rough centralized information system,a method of attribute dependence and candidate key recognition based on attribute reduction of difference function is proposed.Through the non-normalized header nested structure,the concepts of kernel-like and non-candidate key sets are proposed.Based on this,the attribute reduction algorithm based on difference function is improved.The algorithm uses the incremental same thesaurus data sets,the class kernels,the non-candidate key sets,the kernels and the difference functions of each table,and the attributes of the topics to jointly calculate the attribute candidate key.Finally,all attribute dependencies of the topic are obtained in two-dimensional tables data set,and proved its feasibility from the experiment.

Keywords/Search Tags:

Semi-structured tables, First normal form, Rough set, Difference function, Kernel-like

PDF Full Text Request

Related items

1	Rough Approaching Of Structured Rough Set Approximations
2	Research On Extreme Learning Machines Optimization Methods
3	Research And Application Of Extraction Method Of Semi-structured Text Information
4	Investigations On Normal Forms In Intermediate Logics
5	Research On Temporal Difference Algorithm Based On Kernel Function Approximation
6	Research On Large-scale Structured And Semi-structured Biodata Query Method
7	Research Of Database Normal Form Decomposition
8	Study On Data Dependency Theory In Temporal Database
9	Research Of Information Retrieval Based Semi-Structured Data
10	Application of hierarchy-structured decision tables in automated vehicle control algorithms