Font Size: a A A

Research And Improvement For Semi-supervised K-means Clustering Algorithm In Data Mining

Posted on:2011-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2178360305954379Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Data mining technology is an important subject in the current machine learning, pattern recognition, computer science, intelligent computing technology, applied mathematics, statistical learning methods, and intelligent robotics research. Data mining techniques are applied in the database, statistics, optimization techniques, artificial knowledge, pattern recognition, parallel computing, machine learning, neural networks, data visualization, information retrieval, image and signal processing and spatial data analysis.With the rapid development of modern computer technology, information technology and communication technology, how to analyze, refining and digging out the implicit, previously unknown, novel, potential applications for decision-making knowledge from the available data, has been an problem that is urgent need to address.This focus on data mining field, for which the problem of the cluster analysis, expands the algorithm and research application. Basing on the traditional K-means clustering algorithm, in order to improve the efficiency of the algorithm, presents a improved K-means clustering algorithm which based on data segmentation to select the initial cluster centers, the above algorithm is applied to statistics of our country various regions urban residents household income and expenditure basic situation and achieved good results; Combine to semi-supervised learning method, proposed semi-supervised K-means clustering algorithm, for the choice of the initial cluster centers, proposed an improved semi-supervised K-means clustering algorithm and the algorithm is applied to statistical data of our men and women's height and weight, obtained better results.The main contribution and research findings of this paper are as follows:1. Provide an overview on data mining research.Introduced and summarized the significance of data mining, the main content and applications, discussed the current problems in data mining, and points out the future research and development direction. Data mining technology is the rise of a cross-disciplinary in late 20th century, 80s. The current development state of Data-mining capabilities and product is database, information retrieval, statistics, algorithms and machine learning multi-disciplinary multi-impact results. With the rapid development of modern information technology, communications technology and computer technology, the scope, depth and scale of database applications are expending. Most of the traditional information system is query-driven, database as a historical knowledge base for the average query process is effective, but when the size of data and the database increase sharp, the traditional database management systems query retrieval mechanisms and statistical analysis methods can not meet the real needs, automatically, intelligent and quickly dug out useful information and knowledge from the database is an urgent requirement. In general, data mining work can be divided into two categories: descriptive data mining and predictive data mining. Data mining in financial data analysis, research of gene sequences composition, retail data analysis, telecommunications and other areas all have a wide range of applications. Where there is data, where there is data mining.2. Introduce and analyze the related theory and methods of clustering problems in data mining.Clustering problem is to identify classes which implicit in the data. Category refers to data sets with similar properties. As the different similarities that can have different clustering methods, for example, described the similarity with the distance. Generally, describe the manner of similarity given by the user or expert. A good clustering method can produce good clustering, in order to ensure the less similarity between class and class, and a high similarity in each class internal. Clustering algorithm can be divided into two major categories of hierarchical methods and classification methods. This paper introduces the hierarchical algorithm, described the division algorithm, in the division algorithm, in particular pointed out the K-means clustering algorithm, and gives a brief description of the relevant example of the solution process of the algorithm. Finally, as compared to the relevant algorithms, in which K-means clustering algorithm in the space complexity and time complexity are the smallest.3. Introduce and analyze semi-supervised learning methods.In the traditional supervised learning, the training device marked by a large number of data to learn in order to build models to predict the unmarked data. But to get the data marked is often difficult, expensive and very time-consuming, often requires experienced researchers to mark. With the rapid development of the data collection and storage technologies, unlabeled data collected is very easy, but using only unlabeled data clustering results could have a tremendous error.Obviously, if using only a small amount of "expensive" marked data without using the large numbers of "cheap" unmarked data, the data is a great waste of resources. Semi-supervised learning method is a way of learning which is used to handle a large number of unlabeled data and a small amount of marked data. Semi-supervised learning combines a small amount of "expensive" marked data and the large number of "cheap" unmarked data, avoiding a tremendous waste of data resources, in the theoretical research and practical applications are of great significance. In this paper, semi-supervised classification is given in five kinds of learning methods, in the semi-supervised clustering is given the icon description.4. Study the K-means clustering algorithm, propose two improved algorithms and Semi-supervised algorithm and two improved Semi-supervised algorithms.K-means algorithm is a sure means algorithm for k-center. Its idea is that if a class is confirmed, then the class centers of the data points within the class of the geometric mean. When the initial choice of the initial cluster centers, K-means clustering algorithm the initial centers are randomly selected, randomly selected, results will lead to less efficient clustering algorithm, that algorithm for more iterations, CPU running time than the long. To this end we propose an improvement of the initial point selection algorithm, called the improved K-means clustering algorithm. The algorithm uses data segmentation, data collection of the sample points were divided into k-paragraph, take a center within each segment as the initial center. This approach avoids the choice of the initial center too close. In this paper, experiments show that the algorithm is effective. This combination of semi-supervised learning another idea presents a semi-supervised K-means clustering algorithm, the initial cluster centers by expanding the choice of methods to be used for semi-supervised learning. In the semi-supervised K-means clustering algorithm, the choice of markers is very important, its results of clustering had a significant influence. This algorithm is applied to the two-dimensional data clustering, it examined the effectiveness of the algorithm.The results of this research enriched the clustering problem in data mining theoretical and applied research. This cluster analysis, K-means clustering, as well as K-means clustering semi-supervised learning research, possesses some theoretical and application value.
Keywords/Search Tags:Data Mining, Semi-supervised Learning, K-means Clustering
PDF Full Text Request
Related items