Font Size: a A A

Research On Multi-relational Data Clustering Algorithm And Its Application

Posted on:2015-10-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ChengFull Text:PDF
GTID:1318330518472864Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,large amount of data has generated in practical application from all walks of life.In order to find useful information and knowledge from these large amounts of data in its application field,as an important process of knowledge discovery,data mining technology has caused widespread concern.As a method of data mining,clustering analysis aslo become an active research topic of data mining.Clustering analysis is an unsupervised machine learning algorithm,and groups data into classes or clusters,so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters.And it can be used to discover the internal structure of data,to observe the characteristics of each cluster,and to focus on a particular set of clusters for further analysis.Most clustering algorithms are only appropriate to to data in a single table,while in many practical applications,most structured data are stored in multi tables of a relational database.Although we can merge the multi relational tables into a single table through connection or aggregation operations,this process mode will generate high dimensional data,and the data may be distributed in different dimensional subspace after the consolidation,which will make data objects located at different dimensions can be considered as equal distanced,and makes the distance measurement become meaningless;it is also difficult to reflect the effect of relationships between different tables on clustering.Multi relational data clustering is generated oriented to the application requirements.However the research on multi relational data clustering algorithms have not given effective solutions in the face of some existing problems,such as there exists one-to-many relationship between objects,the incomplete corresponding information makes the target objects may be described by information of different order,and loops resulting from relationships between tables may make the information is reused.In addition,for a complete clustering analysis process,after the clustering,we still need to evaluate the quality of clustering results,examine whether the results correspond with the internal distribution characteristics of data,which is verifying the validity of clustering results;And we also need to analyze and interpret clustering results with reasonable and effective methods,in order to help analysts to make decisions and detect new analysis key points.Consequently,for the main problems existing in multi relational data clustering algorithms,clustering result evaluation and interpretation methods,this thesis has carried out the following aspects of research work.(1)For the problems in multi relational data clustering,such as it will ignore the primitive characteristics of data to extract information corresponding one-to-many relationships using statistical methods,and the loops resulted from relationships between different tables may cause the reused information,an effective multi relational clustering algorithm is proposed.Firstly it is the different kinds of relationships between tables to result in the above problems in the data set,and the relationships in IDEFlx model can be used to explain the reasons,therefore the hierarchical framework of multi relational clustering based on IDEF1x model is builded.And then it is studied that different types of relationships in this framework impact on the clustering results transference,and integrate clustering results of multiple child nodes.Subsequently a new multi relational data clustering algorithm is proposed,in order to realize the ultimate purpose of assisting target object clustering effectively.(2)Aiming at the problem that target objects in multi relational clustering may be described by information of different order,a multi relational clustering algorithm is studied without any loss of data information.And the framework of this algorithm is still based on the relational hierarchical based on IDEFlx model,simultaneously the target objects described by incomplete information are regarded as uncertain data.Firstly,a multi relational uncertain data model based on Kripke structure is built to depict the integrity of data description information.And the uncertainty of data is described further based on probabilistically constrained regions,and then the measurement of distances between uncertain data is defined.Finally a multi relational clustering algorithm is proposed based on probability constraint region,which can ensure the validity of multi relational clustering without destroying the primitive characteristics of data.(3)Traditional clustering evaluation methods almost analyze the validity of clustering results according to values of evaluation indices,and they all have certain limitations.Therefore focusing on clustering process,it is proposed that the clustering processes corresponding effective clustering results should satisfy some state properties,and the clustering process is abstracted and modeled based on program diagram and transition system.And then the problem of the judgment of clustering validity can be transfered to the problem of verifying whether the model of clustering processes satisfies the specified properties with model checking algorithm.And the algorithm can not only directly conclude the validity of clustering results,it can also get the iterations influencing the validity of clustering results based on analyzing counterexample if the results are not invalid.That is,this study tries to build a bridge between clustering and model checking.(4)Common clustering results interpretation methods,such as analysis of the distribution characteristics of attribute values,the distribution of data,lack the quantitative measure of differences between clusters based on every attribute,while these differences can just reflect the effect degree of attributes on the clustering results,and can further help analyse the significance of attributes effecting clusters.Accordingly based on the idea of one way analysis of variance,a clustering result analysis algorithm is proposed.It first compares differences of intra-clusters and inter-clusters about every attribute,and defines a measure of impact degree of single attribute and correlation attributes effecting on clusters.While this impact degree can be regarded as impact factor of attributes.In the end,the work of the dissertation is concluded and the further research direction is put forward.
Keywords/Search Tags:multi-relational clustering, IDEF1x model, result evaluation, computation tree logic, analysis of variance
PDF Full Text Request
Related items