Font Size: a A A

Research On Topic-oriented Semi-structured Data Integration Methods

Posted on:2019-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:H L SuFull Text:PDF
GTID:2348330566964280Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Currently,the popularization and application of the Internet have changed the way that people publish and access information.Almost all organizations and users will choose the Internet to publish data.However,due to the Internet provides a variety of distribution formats and there is no uniform format requirement,institutional and user-published semi-structured data tables tend to have inconsistent logical structure,which brings great challenges to the users who need to collect information in the same domain.How to structure and standardize these structured data tables becomes a problem to be solved.In this paper,we study the problem and put forward the normalization,attribute dependence and candidate key identification of the topic-oriented semi-structured data table.The main contributions are as follows:(1)Proposed semi-structured data table integration framework.For the integration method of semi-structured data table,the overall framework is proposed.The whole process of data processing is described.The concepts of normalized tables,non-normalized tables,cells,attribute reduction and difference functions are formally defined.(2)Proposed a non-normalized table standardization method.According to the definition of the table,a method of converting non-normalized tables that do not conform to the first normal form(1NF)into 1NF normal tables is proposed.By comprehensively analyzing the header characteristics of non-normalized tables,a header-based tabular specification method is proposed,which includes identifying non-normalized tables,structural transformations of non-canonical headers,and extracting attribute dependencies in the header.(3)A theme oriented attribute dependence and candidate key recognition method are proposed.Inspired by the attribute reduction algorithm of rough centralized information system,a method of attribute dependence and candidate key recognition based on attribute reduction of difference function is proposed.Through the non-normalized header nested structure,the concepts of kernel-like and non-candidate key sets are proposed.Based on this,the attribute reduction algorithm based on difference function is improved.The algorithm uses the incremental same thesaurus data sets,the class kernels,the non-candidate key sets,the kernels and the difference functions of each table,and the attributes of the topics to jointly calculate the attribute candidate key.Finally,all attribute dependencies of the topic are obtained in two-dimensional tables data set,and proved its feasibility from the experiment.
Keywords/Search Tags:Semi-structured tables, First normal form, Rough set, Difference function, Kernel-like
PDF Full Text Request
Related items