Font Size: a A A

Analysis Of The Clustering Technology

Posted on:2015-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q LiuFull Text:PDF
GTID:2268330431457574Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering technique, a combination of mathematics and statistics, is applied in the fields of computer science, biology and economics. Cluster analysis is a classical method in data mining, which aims at gathering the data into clusters based on their characteristics and a similarity standard. This method is carried out by a comprehensive analysis of characteristics of data to be grouped, and then establishing the similarity standard, finally putting forward the cluster algorithm based on the similarity among data objects, hereby the purpose of classification is achieved. Although the clustering method has made success in text classification, Web page classification and the Web user classification, different algorithms have various effects resulting from their application context. This paper is designed to probe into the application of clustering analysis methods in multi-database classification and text classification depending on the internal relations between objects.Corporations, especially multi-national ones, have founded more and more transaction databases which called multi-database with the advance of information technology, the expansion of their size and branches. Not only the number of databases in a multi-database environment is huge, the quantity of data in each database is also immensely huge. Therefore, traditional mining technology of single database loses its power to meet the needs of multi-database mining. It has been proved that the most effective strategy for mining multi-database is classifying the databases before mining, so multi-database classification has become a new technical crux which needs immediate resolution for clustering analysis. Based on the data characteristics of multi-database and present research, this paper proposes a new standard to judge the similarity between different data and designs corresponding clustering algorithms at the same time.Text is widely used as an information carrier. Text information processing is an interdisciplinary operation which covers statistics, machine learning, pattern recognition and data mining. Because of the large quantity of text information, the most effective way for mining text is classifying them in advance, and then mining the model for each class. Thus text classification has become an important field in text information processing. A single text can be analyzed as a collection of words, which is the same as to treat a single transaction database as a set of transactions, so the data of text and database have internal connections. This paper therefore puts forward a new algorithm for clustering text with references to the strategy of clustering multi-database.In the process of research, we will learn about the technical foundation of clustering analysis method and delve into the theory of multi-database mining and text mining firstly, then construct a new evaluation standard according to the characteristic of multi-database object, and the new evaluation standard is going to be analogized to the text categorization. Finally, the thesis will design corresponding algorithms for clustering multi-database and texts. The following shows the main contents of the research:(1) Propose an improved multi-database clustering method based on an existing algorithm.Some of the algorithms for clustering multi-database can achieve good results, but it is also possible to miss the best classification during the choosing process. To improve this situation, we propose an improved method based on an existing algorithm, and validated it in the artificial data set. The experiment shows that this new algorithm can obtain better classification results in some cases, but the time complexity of this algorithm is relatively high, so it is suitable for the application environment that requires higher classification accuracy.(2) Design a new multi-database clustering method based on an improved PAntSC*algorithm.The PAntSC*algorithm has been applied into text classification, but the number of categories are needed in advance. In this thesis, we put forward an improved method based on PAntSC*algorithm and applied it into the multi-database classification procedure. At first, we establish the database sequence L according to their Silhouette Coefficient, and then gather them into the appropriate categories based on the improved PAntSC*algorithm. At last, we will determine the optimal classification according to the criteria of result evaluation. Theoretically, our method avoids the limitation of the traditional PAntSC*algorithm which needs to specify the number of categories in advance. The feasibility and effectiveness of this algorithm is verified through experiments and practice.(3) Propose a text clustering algorithm based on Huffman tree.A text is a set of sentences and each sentence is composed of words. Transaction database is a collection of records, each record is consists of transaction items. So there exist inherent associations between the data in text objects and multi-database objects. According to the characteristics of text data, we adopt the clustering strategy of multi-database classification research and propose a text clustering algorithm based on the structure of Huffman tree and screen out the best clustering result according to the evaluation standard. Experiments are carried out in Chinese classification corpus, the result is not the optimal one, but it proves that our algorithm is feasible and effective.In this thesis, we research on the application of clustering analysis in multi-database classification and text classification. Three clustering algorithms have been proposed and experiments have been carried out to prove the feasibility and effectiveness of the new algorithms. The research of this thesis lays a solid foundation for clustering technology in theory, and proposes new clustering methods for multi-database classification and text classification in application.
Keywords/Search Tags:Data Mining, Cluster Analysis, Multi-database Clustering, Text Clustering, PAntSC~*, Huffman Tree
PDF Full Text Request
Related items