Font Size: a A A

Research On Semantic Integration Techniques In Heterogeneous Databases

Posted on:2006-04-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:B H QiangFull Text:PDF
GTID:1118360155472592Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the great developments and extensive applications of information technology and internet, the life, work, and study of people become more convenient. But with the increasing requirements of comprehensive information usage, the defaults of internet are becoming obvious. Computer networks just connect the hardwares, and the data on different computers are heterogeneous and become information islands, on which data sharing and interoperability become more and more difficult. Limited data sharing, difficult data communications, and inconsistent formats are the main bottlenecks for realizing data comprehensive usage. So how to detect and resolve data conflicts and heterogeneity, and integrate the heterogeneous database to realize data sharing and comprehensive usage are the fundamental issues for new information technology application. Database integration technologies provide effective channels for detecting data heterogeneity, modifying data earlier, resolving data incompleteness and inconsistencies, and finally improving data quality for comprehensive usage. Finding the corresponding semantic objects, i.e. the semantic integration, is the most important issue in heterogeneous database integration domain. Exactly, the task for semantic integration is to find the corresponding attributes and entities in relational database domain. Based on the research background and the characteristics of the present heterogeneous database semantic integration techniques, the dissertation is intend to develop research on heterogeneous database semantic integration using neural network considering the self learning and generalization abilities of neural network. In the dissertation, identification of corresponding semantic objects, i.e. attributes matching and entity matching, is mainly studied. At the same time, in order to estimate the attribute weight accurately, attribute entropy and mutual information are also considered to be used to calculate attribute weight when solving entity matching problem. New algorithms for attribute matching and entity matching are proposed, and the experimental results show the algorithms can improve the precision and recall obviously. The main contributions of the dissertation are summarized as follows: ①The present main issues for heterogeneous database integration are surveyed comprehensively; the task of heterogeneous database semantic integration, the types of semantic heterogeneity, and the approaches to resolving semantic heterogeneity are introduced in detail; the existing problems of present semantic integration techniques are also studied. Finally the feasibility to resolve attribute and entity matching problems are analyzed using neural network, entropy, and mutual information. ②The defaults of attribute matching based on BP neural network in present literatures are analyzed. It is pointed out in the dissertation that different inputs for neural network may have the same outputs is the main factor to decrease the accuracy of neural network, which is illustrated and tested. So the idea for establishing multi classifiers is proposed, in which the neural networks are trained several times using the same training data set with different initialized connective weights and thresholds. Our proposed multi classifiers can filter the interference data effectively. The effectiveness of this approach is tested in chapter three and chapter five. ③Due to the existing problems of attribute matching in heterogeneous database semantic integration, two-phase-check algorithm for attribute matching based on BP neural network is presented according to the idea to establish multi classifiers, in which attributes are required to be categorized according to data types firstly, then the BP neural network architecture is defined according to the characteristics of categorized attributes, and the BP neural networks are trained several times respectively using the categorized attributes with different initialized connective weights and thresholds. In the procedure of attributes matching, the attribute characteristic vector is input into the corresponding network, and the final attribute matching result are the intersection of every time matching result in different neural networks. The experimental results show our proposed approach can improve the attribute matching accuracy and decrease the training time obviously. ④Considering the existing defaults of weights assignment on entity matching during heterogeneous database semantic integration, an approach to computing attributes weights based on attribute entropy and the decision model for entity matching are proposed in the dissertation as well as the matching algorithm for heterogeneous entities based on attribute entropy. Our proposed approach can make good use of the instance values information of attributes, which is objective and is easily quantified. The experimental results on real-world data indicate our approach can get high accuracy. Meanwhile, different attributes with the same entropy will get the same weights directly by computing the attribute entropy, which can not effectively differentiate the importance of attributes. So mutual information between attributes is considered and used to compute attribute weight as well as attribute entropy. In thedissertation, the algorithm for computing final entropy of attribute is also proposed. The precision and recall for entity matching are improved further. ⑤Besides the attribute entropy and mutual information approaches used to resolve the difficulties for computing attributes weights, the BP neural network is introduced to entity matching domain in the dissertation. BP neural network can identify the corresponding attributes by analyzing the inner relationship among attributes using the self learning ability of BP network itself, and avoid computing the attribute weight directly. Firstly, the factors to interference the performance and accuracy of neural network are analyzed based on a practical classification problem. Then the entity matching algorithm, modified algorithm, and two-phase-check entity matching algorithm based on BP neural network are proposed respectively. The experimental results show our approaches are very effective. Especially, the two-phase-check algorithm based on BP neural network for entity matching can resolve the problem that different inputs for neural network may have the same outputs effectively and avoid resulting in the error results, and improve the accuracy of entity matching further. Finally, the researches in the dissertation are summarized and the future works are presented.
Keywords/Search Tags:Heterogeneous database semantic integration, attribute matching, entity matching, BP neural network, two-phase-check algorithm, entropy, mutual information
PDF Full Text Request
Related items