Font Size: a A A

Research On Some Key Problems Of Data Integration In Dataspace

Posted on:2015-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:S JiangFull Text:PDF
GTID:1228330467950238Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In recent years, as the development of digital technology, the amount of data is rapidly growing, and the nature of data itself is changing diversely. The up-to-date network technologies make the relationship among data in great distances to be more in compact. Traditional data management technologies have made great contributions for data management since the last decades of years. However, the occurrence of these new features exposes the short-comings of traditional technologies when handling complicated and ever-changing data. Under this circumstance, researchers tried to seek a novel technology, dataspace, to meet the expanding requirements from data management.Dataspace is newly emerging area, where still remaining many research points. This thesis focused on the key points of data integration in dataspace. The purpose of data integration was to resolve the problem from sharing and management of heterogeneous and distributed data, while dataspace had the same goal with great difference in integrated object, integration approaches and data integration technologies. Dataset from the dataspace used wrapper to extract information from data sources. Firstly, the correlation between data information and the main part of dataspace is required to be evaluated to determine whether to store the information into dataspace. By next, as dataspace also stores relationships among data, the information with higher correlation would be found and refined on correlations for dataspace integration. Finally, the data schema would be verified by matching the dataspace schema summary, and stored the information and relationships that satisfied the correlation into dataspace. Therefore, the key points of data integration in dataspace are consisted of evaluation of correlation, finding of data relationship and creation of schema summary.According to the above analysis, this thesis had the following contributions. 1) Dataspace is constructed around the subject, and could be controlled by the subject. All the data within the dataspace has correlation to the subject to some extent. So it is important to evaluate the correlation between the data and the subject. This paper takes research on PFC, which is an algorithm for calculating the correlation based on operation behaviors. The user’s operation behaviors are collected, analyzed and stored using an operation behavior collecting algorithm, and are organized as an information set using the Vertical model. This paper also takes research on the extraction of the core-word set, and gives out the evaluation methods of relationship among operation behaviors and among user’s access information. Combined with the correlation of the above and the frequency of core words, this paper proposes the evaluation method of core word weight based on CTFS for extracting core-word set. Based on the research of core-word set, along with path length, occurrence frequency and semantic content, this paper proposes PFC data correlation algorithm and an evaluation method of data quality for services of querying and sorting in data space. The experiments display the results of core-word extraction, data correlation and data quality, proving the efficiency and availability of the above algorithms.2) Most traditional data management technologies only store the basic data information without concerning about the information in depth. Differently, dataspace stores not only the content of data but also the relationship among data, thus finding such relationship has become an important topic for dataspace research. This paper takes research on finding implicit relationship on the basis of subject characteristics. It divides the relationship of data atom in dataspace into dominant relationship and implicit relationship. It focuses on the dominant relationship and then introduces subject characteristics to find the implicit relationship. In the dominant relationship part, this paper firstly describes the data atom using5-ary vector and gives out the approaches for measuring the importance of data atom attributes for extracting important attributes as core-words. Then it proposes CWD model and defines data atom set containing the same core word for extracting the data atom relationship. It also defines the group category and its relationship, combining with such data atom set containing the same core-word to give out the finding approach of dominant relationship based on data atom set, group category, and group category relationship (DCR). In implicit relationship part, this paper firstly proposes the further definition of support and confidence based on subject characteristics, and then with the fundamental of dominant relationship, it gives out the finding approaches for the implicit relationship among data atoms based on the frequency item set generated by the support and confidence from subject characteristics. Finally, the experiments prove that the changes of attributes and group category relationship as well as the frequency item set based on subject characteristics have obvious effect on data atom relationship.3) Users mainly get to know the system infrastructure by grasping the schema information. However, the database schema is usually very complicated and users need to take great time and cost to understand the schema information. Dataspace weakens the schema, causing users to cost much more time to understand dataspace when accessing. So it is very important to build a schema summary for dataspace. This paper proposes an approach for extracting the schema summary of dataspace based on information difference, which could help users to understand the structure of dataspace faster and more accurately. It firstly uses PageRank algorithm to calculate the node importance, and then the approach for selecting the preferred node, which considers the effect on the node importance from both the connectivity of nodes in schema graph and the frequency distribution in data graph. Then it uses the preference value of information difference, calculating and analyzing the information change difference (ICD) generated by the nodes, to determine the user’s access trends and interests, providing the extraction of schema summary with candidate node set that user interested in. After that, it analyzes the characteristics of schema partition in dataspace and combines the schema partition with association construction, using the schema partition algorithm SPIP based on the edge betweenness, to partition the nodes in schema graph. Moreover, it makes use of a module degree function to measure the partition quality and output the entire process of extracting the schema summary. One experiment compares the traditional greedy algorithm to the partition algorithm above, proving that the latter has better performance on both the efficiency and accuracy. The other compares the query efficiency under three conditions of using schema summary or not, and the result shows that using schema summary could improve the efficiency and reduce the cost of the query.Above all, this paper takes further research on the core problems of data correlation, relationship discovery and schema summary extraction in data space integration. For the correlation of data and subject, it proposed PFC algorithm for the analysis of user’s operation behavior. For the relationship discovery, it gives out the approaches for finding dominant relationship based on DCR and implicit relationship based on subject characteristics in dataspace. Moreover, according to the shortcomings of dataspace those have weak schema, dynamic subject and data and difficulty in schema matching, it proposes approaches for extracting schema summary based on information difference, improving the capabilities of matching dataspace with data information, and guaranteeing accurately locating user requests and query.
Keywords/Search Tags:Dataspace, Correlation, Information Extracting, RelationshipDiscovery, Schema Summary
PDF Full Text Request
Related items