Font Size: a A A

Research And Implementation Of Density-Based Clustering Algorithm Concerning Vector Direction

Posted on:2010-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:J W WuFull Text:PDF
GTID:2178360272995833Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Research and Implementation of Density-Based Clustering AlgorithmConcerning Vector DirectionAs more and more information has spurt out into people's daily life, with the development of computer technology and database, people begin to try to find useful information among so large quantities. Data mining is the process of extracting hidden patterns from large amounts of data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.Clustering is an important analytical tool in data mining. It is like classification but the groups are not predefined, so the algorithm will try to group similar items together. Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure. Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis.Density-Based clustering analysis is one kind of clustering analysis methods which is demanded to deal with very large databases while offering good results of clusterings. It has two parameters: eps and MinPts. To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points.The edge of the clusters is the aggregate of that objects between high distinct and low distinct. But there are some shortcomings of those density based clustering algorithms: because of the accordant parameters, the algorithm get not good clustering result when facing those not symmetrical database and is not good for high dimensionality database. The mean cause is that the accordant parameters can not handle the cases that there are high diversity distribution among data in database. But in the actually application, with the expanding of database and the increase of the dimensionality, data has complicated distribution. Those irrespective attributes in the database has lower the clustering tendency, the circumscription between clusters is illegible and there might be partial superposition between each other. DBSCAN and other density based algorithms can not handle those problems below, low accuracy result, not good result and affect the accuracy of knowledge offering.According to the Newtonian theorem in physics, there is mutual effect betweentwo particles and force F=(?). A force has its magnitude and direction. We bring this vector relative theorem into the effect among objects in one database, that is: the mutual effect between objects has infection both in distance and in direction. Therefore when considering the effect that one object in the database has from other objects, we should take these two effect into calculating: effect of scalar quantity: that is the effect of distance between two objects and effect of vector quantity: that is the effect of direction between two objects.Based on the analyses upwards, now we give the define of influence function of one object in database:F(x); // The general define of the influence function of one object in database{f(x); // Influence function of scalar quantityg(x); // Influence function of vector quantity}Influence function of scalar quantity, that is one way that influence between objects should be measured by their distance. All the density based clustering algorithms mentioned in this paper use influence function of scalar quantity to measure their influence. Such as DBSCAN, its Influence function of scalar quantity is square wave function.By analyzing limitations of the existing density-based clustering algorithms, the vector influence between points is discussed including distance and direction. Definition such as scalar influence function, vector influence functions are introduced Expression of, vector influence function and two methods: similarity and sum are introduced.. The algorithm deals with the core point by getting the projection of the points in its neighborhood to judge whether it is balanceable. Only balanceable core points can be expanded to form clusters. The theoretical analysis and experimental results indicate that the algorithm can discover clusters with arbitrary shape and. can effectively eliminate noise such as boundary sparse points. It solves the difficulties of clustering high dimensional spatial data such as the spatial distribution of the data, not obvious boundary between clusters, too many noise data points and the phenomenon that the distance between the distances to the nearest and farthest neighbors of a data point goes to zero etc. The algorithm improves the accuracy of clustering and offers better results of clusterings on various data sets. It executes effectively and efficiently. Meanwhile the choice and impact of the parameter in the algorithm are discussed. The algorithm is scalable and general.Taking both the accuracy and time spending into account, both two algorithms of DVDC has more accurate clustering result. Since the sum coefficient of the sum method should be initialized according to the actual situation, so we suggest that similarity should be taken into consideration first when facing a small scale database because it has less time spending and high accuracy with a satisfactory clustering result. When the scale of database has grown, sum method is suggested because it offers an optimized result within minimized time spending.In the future work we will do more research in how to lower the time complex of DVDC in order to lower the time spending of neighboring balance judge and dealing. And we will discuss the proper option of parameters in order to lower the influence brought to the algorithm. Furthermore we will do research on other methods to define the influence function of vector quantity among objects to increase the expansibility.
Keywords/Search Tags:influence function, direction Similarity, boundary sparse points, DBSCAN
PDF Full Text Request
Related items