Font Size: a A A

Clustering Algorithm Based On Graph Theory And Its Application In Genetic Data Processing

Posted on:2008-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:C J SunFull Text:PDF
GTID:2178360212497045Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of biotechnology and its increasingly widerange of potential applications. Gene chip technology enables more andmore to pay attention to combine computer technology and biologicaltechnology in a new field of bioinformatics. Gene chips have greatlyaccelerated the understanding of biological information, but the resultingmass of data. Now the key is how to effectively analyze data from theexcavation meaningful information. To conduct large-scale gene expressiondataanalysis,dataanalysisalgorithmsandcorrespondingsoftwareisneeded.The research based on new clustering methods, with the aim of solving theproblems that exist in previous methods has improved efficiency, theefficiency of expression profile data clustering algorithm, the accuracy andpracticality and the capacity and efficiency of existing microarray dataanalysismethods.The object of Clustering is composed of several model, usually, themodel is a vector, or a point of multidimensional space. Clustering analysisis based on Similarity, a cluster model in the same cluster is more similaritythan not. To measure the similarity and close relationship between theclassification requires some definition of classified statistics. for theclassification of quantitative indicators, which can be used to classify. Theseparation distance is a commonly used statistic similarity coefficients, andtheir definition is based on the types of cluster analysis. the evaluationcriteria of Clustering include : scalability, the ability to deal with differenttypes of fields and found clusters of arbitrary shape, dependent on the inputparameters weak knowledge in the field, deal with abnormal data, not sensitive to the order entered into the record, The ability to deal with highdimensional data,theabilitytoincreaserestrictions afterthe clusteranalysis,explanation and availability. The mainly Algorithms include drawn level,density,meshandmodel.Thereisanewalgorithminthispaper,thisisbasedonthespanwhichisbased on graph theory. Graph Theory is commonly used data structures inthe computer algorithm, and it has a set of complete theoretical system.Which include manyof the definitions, border, apex, the weight, neighbors,complete graph, Central, path, etc., some related theorem. We get newdefinition which based on these theories. Following the discussion and saidhere arenoplans, simpleandthe wright graph.Themaximum weight of thepath is called supporting value of this path,and the side is called bridge ofthe path. Define a path which have the minimum supporting value, as theclosest path. The span is the supporting value of the closest path. The spanof a graph is defined by a span of two vertices. The span of a graph is thelargest span value of all the span between the vertices of the graph. In thesame time Giving simple nature of span, sector, symmetry, and uniqueness.Inviewofspan'sdefinitionandbasicnaturetherearesomesimpletheorem:The wight between the two ends of the smallest right side is the span of thetwo ends; For any vertices, there must is a vertices, the two vertices'spanequivalenttothespanofthegrapha; Ifthespanoftwovertices isequivalenttothespanofthegraph,andthereisaverticestowhichthespanoflessthanthespanofthegraph,mustthisisequivalenttoanotherverticesonthespan;Given a vertex, the graph is divided into two complementary set by span ifwhich equivalent the span of the graph, and the cover of the two sets is empty. These two sets are called pools. The inherent nature is given by thepools definition. There are two concepts -- the inverse symmetry span oftwo minutes -- and clearly described. Two clearly refers to the pools isn'tchanged, and no change about a given point, and differences In fact, Allspanbetweentheverticesofagraphcanbeexpressedasasymmetricmatrixwhich is called matrix span. Matrix's value is the span between two points.The authors give a algorithm which comput the span matrix by adjacencymatrix, the value of the whole matrix is adjacent to the order followed bysmall to large, determine whether each side is a span between tow vertices.Thisistheconceptonsomeissuesrelatedtothespan.Using concept of span above given, we design the clustering algorithmbased on the span. The algorithm first calculated span matrix according tothe adjacency matrix, then calculated the pools according the span matrix.For each categoryuse the above steps, until can't change. Can't change, wehave established a principle, assuming that no isolated spots. This shouldminimize the value of each point of maximum span as a standard. It'sdifference about hierarchical clustering method is only in a fewcircumstances we need too many steps. Under the normal need one time toscatters the set, end the clustering process. the parameters of this Clusteringmethod is the smallest size of the cluster, the process is stoped when thecluster size is smaller than the smallest division results. The algorithm canfound clusters of arbitrary shape, and to deal with abnormal data, and theorder is not infact the result. The flexible and handle different types of fieldcapacity, a weak dependence about the parameters knowledge in the field.Algorithm can't well for the high-dimensional data, and the cluster analysis capbilityafter increased restrictions, Availabilityand interpretative nature isnotgood.Spacecomplexityis O ( n 2),thetimecomplexity O ( n 2log n ).Finally, we use a group of childhood acute lymphoblastic leukemiapatients in the gene as samples for testing algorithms. After datapreprocessing and the dimensional reduction, clustering subtype is veryclosetotheactual results,shows that thealgorithm has somepractical value.ComparingwiththeresultsobtainedbyuseofK-meansclusteringalgorithmfordataprocessing,alsohasgoodresults.
Keywords/Search Tags:Application
PDF Full Text Request
Related items