Font Size: a A A

Research And Application Of Data Mining Based On Graph Structure

Posted on:2014-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhouFull Text:PDF
GTID:2268330392973027Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
This thesis studies the data mining problems based on graph structure, that is to organizespecific data into a data structure in graph form to gain useful information.The currentmainstream data mining algorithms handle mostly vector data,while research areas in the EPCSystem Network and social networks using only the expression of vector data,which willproduce data attribute submergence.That can not effectively express attribute relationshipsbetween vector data because these data derived from the real world has natural structuralproperty.Graph structure can effectively avoid Properties sank, making the data attributesCorrelation fully express, acquire richer additional information than vector. How to organizethese data into a graph structure and operate effectively become a new hotspot in the field ofdata mining. The frequent subgraph query and graph classification is the core of the graphicaldata mining and the study basis of other map data.This thesis improve the code representation by the discussion of subgraph query andclassification algorithm, making the query of ordinary Figure be extended to directed graph;It isproposed that sampling strategy be merged into the field of graph classification, which enhancethe classification model construction efficiency; Graph mining is applied into the field ofbioinformatics.In this thesis, the major work is done as follows:1.The proposed new algorithm DFSS makes applicability improvements on gSpan, thefigure coding techniques adopted in the algorithm is different with traditional algorithms suchas FSG, FFSM, AGM. it proposes the concept of level degree and connection degree,which canextend the appliation scope into the study of directed graph.So far, a series of frequent subgraphmining are mostly based on undirected graph knowledge discovery, while the mining ofdirected graph is still rare.The designed algorithm improved to some extent in terms of timecomplexity and mining efficiency compared with the algorithm based on Apriori thought suchas FSG, AGM. Experimental results show that under the premise of without loss of miningintegrity the efficiency FFSM algorithm70-80times.2.Traditional graph classification algorithm, such FSG and CEP, is too low efficiency dueto the low support threshold selection, and classification precision decreases due to highthreshold selection which result in the loss of important mode. To solve these problems,sampling learning strategies are introduced in graph classification, the average degree conceptis proposed.We select a representative sub-mode by the calculation of the vertex average degreeof the sampling under the premise of maintaining classification accuracy, combined with CEPgiven frequent closed revealed model to design a new graph features (classification rules)extraction method.It solved the phenomenon that CEP algorithm can not be calculated due tothe too low support threshold and greatly enhance the classification efficiency.The proposedalgorithm is better than some of the existing mainstream algorithms proved by experiments.3.A strategy based on frequent subtree mining repetitive DNA sequence recognitionmethod is proposed,avoiding sequence alignment. The algorithm organize the sequence assuffix tree, and simplification improve it to make it more suitable for the subtree miningoperations, and finally using frequent subtree mining methods to learn to avoid the waste of time caused by a short repeat stitching.The designed "two step identification technology " alsohas good recognition results for fuzzy duplicate and improve the recognition integrity.Experiments show that the algorithm perform better in recognition efficiency, especially thelong repeat identification, the advantages is more obvious, and so do the identification integrity.
Keywords/Search Tags:Graph data mining, Directed sub-graph, Graph classification, Sampling strategy, DNA repeated recognition
PDF Full Text Request
Related items