Font Size: a A A

Super Data The Integrated Mining Method And Technology Research

Posted on:2013-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y ZhouFull Text:PDF
GTID:1118330371458959Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hyperdata refers to the data object that is linked to other data objects by semantic relations which form a data web. For the integrative mining applications, hyperdata can act as the data source perfectly that provides plenty of richly structured linked data. However, the richly structured,distributed and large scale nature of hyperdata also brings with many problems, such as its distributed nature causes the data heterogeneity problem, its structured nature causes the incapability of traditional data mining methods and the scalability requirement of mining methods caused by its large scale nature. So far, in the hyperdata community, the current research work and applications lack of systematic, effective approaches or applications to address these problems and to support the development of the integrative mining on the hyperdata.Under such background, this thesis surrounds topic of the integrative mining on the hyperdata, starts with the areas of hyperdata preparation, hyperdata integrative mining and using cloud computing techniques to support integrative mining on large scale hyperdata, and then proposes techniques, solutions to solve the problems metioned above. The main contents and contributions of this thesis can be concluded as follows:â–¡Hyperdata Preparation, includes two parts:hyperdata acquisition and hyperdata integration(1) Hyperdata acquisition:a domain ontology-based method of auto hyperdata acquisition from textual documentsTo implement the transformation from Web data (especially text documents) to hyperdata, this thesis first proposes a domain ontology-based method of auto hyperdata acquisition from text. Hyperdata is linked to other hyperdata by multiple dimensions and complex semantic relations which form a hyperdata graph. A sentence is the basic unit of text, which may contain several hyperdata nodes and multiple types of semantic relations may exist amongst these nodes. So, the method uses the hyperdata graph as the suitable structure to express the hyperdata emended in text and makes use of natural language processing, data mining, and statistic theory to implement the hyperdata acquisition from text.(2) Hyperdata integration:a semantic-based multiple hyperdata sources mashup methodThe data quality is higher, the performance of data mining are always higher. So, the data cleaning is one of the most important preparason tasks, and one of the important steps of data cleaning is the data integration. The distributed nature of hyperdata always brings with two integration problems:data schema and data content heterogeneity. To solve them, this thesis proposes a semantic-based multiple hyperdata sources mashup method to integrate multiple hyperdata sources into a high quality data source without data inconsistence and redundency. Specific to the data schema heterogeneity, this method uses the semantic mapping technique to map multiple data schemas to a universal ontology schema. Confronted with data content heterogeneity, this method uses a hybrid method that combines logic reasoning and text mining methods to identify the hyperdata nodes that have different identification values but refer to the same real-world entity.â–¡Hyperdata Integrative Mining, includes the concept description and mining methods(3) Concept description on hyperdata:a semantic graph template-based concept description methodHyperdata adopts RDF as its data description language that is richly structured and can't be processed by traditional data mining methods, especially machine learning methods. The concept description step is to, based on the data schema, mining methods, describe the relative concepts and give the comparative description. Confronted with the hyperdata schema, this thesis presents a semantic graph template-based hyperdata concept description method, which uses semantic graph template to describe the information sources on the RDF data, including the descriptive property, semantic relation and semantic graph structure to implement the concept description for the integrative mining on hyperdata.(4) Hyperdata mining method:probabilistic semantic learning modelHyperdata adopts RDF as its data description language and SPARQL as its query language. Different from other data schema, it is richly structured and distributed that impede the integrative mining on hyperdata. Confronted with the distributed and richly structured natures of hyperdata, this thesis proposes a probabilistic semantic learning model that uses the semantic graph template to describe the relative concepts and to give the comparative description to solve the problems caused by the natures of hyperdata. Based on the semantic graph template, this thesis proposes the probabilistic semantic learning model that extends traditional Bayesian network learning to perform integrative mining on the linked hyperdata. Meanwhile, it also proposes a semi-supervised learning method to improve the performance of probabilistic semantic learning when the training data is insufficient or its quality is not good.â–¡The Scalability of the Mining Methods:(5) A cloud computing infrastructure-based hyperdata integrative mining prototype systemConfronted with the integrative mining and analysis on the large scale of hyperdata, this thesis proposes a cloud computing infrastructure based hyperdata integrative mining prototype system. This system contains three modules, including storage of large scale hyperdata, SPARQL query on the large scale hyperdata and probabilistic semantic learning on the large scale hyperdata.This thesis surrounds the integrative mining on hyperdata, works on methods, and techniques to solve problems occurring in the stages of the hyperdata preparation, integrative mining methods and the prototype system. First, it proposes a domain ontology-based method of hyperdata auto acquisition from text that implements the transformation from Web data to hyperdata; secondly it proposes a semantic-based multiple hyperdata sources mashup method to integrate heterogeneous hyperdata, including data schema integration and hyperdata entity identification. These two methods form the first part (named hyperdata preparation) of the thesis. The second part named integrative mining method for hyperdata that consists of:a semantic graph template based concept description method for hyperdata, a probabilistic semantic learning model for the integrative mining on hyperdata and a semi-supervised learning method to improve the performance of probabilistic semantic learning when the training data is insufficient or of bad quality. As the third part, the prototype system is trying to extend the scalability of the integrative mining method on hyperdata. It uses cloud computing infrastructure to restore, to implement SPARQL query using Mapreduce model in the Hadoop framework and to perform probabilistic semantic learning on large scale hyperdata.The methods and techniques presented in this thesis try to solve the problems occurring in the stages of integrative mining on the hyperdata, including hyperdata acquisition, integration, concept description and integrative mining methods, caused by hyperdata's richly structured, distributed and large scale natures. In addition, to adapt to the large scale nature of hyperdata, this thesis furthermore presents a cloud computing infrastructure based large scale hyperdata integrative mining prototype system which consists of storage, SPARQL query and integrative mining on the large scale hyperdata that are implemented using Mapreduce and Hadoop. The methods and techniques presented in this thesis establish the solid theory and technical foundation for further research on the integrative mining on the linked hyperdata in the future.
Keywords/Search Tags:Hyperdata, Semantic Web, Semantic Relation, Data Mining, Data Integration, Integrative Mining, Machine Learning, Semi-supervised Learning, Cloud Computing
PDF Full Text Request
Related items