Font Size: a A A

A Study Of Some Crucial Algorithms For Text Mining

Posted on:2011-11-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J D TanFull Text:PDF
GTID:1118330335962561Subject:Department of Automation
Abstract/Summary:PDF Full Text Request
Text mining is a very active studying field, and is an important offset of data mining. It has made full use of the traditional techniques for data mining, and also needs some special methods matching the characters of text. We try to apply support vector machines, manifold learning and graphic theory to design some practical algorithms including classification, clustering, compression, visualization and ranking for text mining. The main works in this thesis can be introduced as follows:1. Based on the proof of a series of Theorems, this paper presents a new continued fraction Mercer kernel, which can be used in SVC algorithm and other SVM algorithm. Experimental results show the SVC algorithm with continued fraction kernel works successfully on real data, and is competitive to the other existing simple kernels. Moreover, this kernel can be used to combine relatively complex kernels such as RBF applying kernel tricks easily.2. In this paper, two novel methods for dimensionality reduction– modified PCA and modified Kernel PCA are proposed. Based on the theory of PCA and Maximizing Margin Criterion, we construct a multi-objective project model to formalize our goals for dimensionality reduction. Then it is transformed into a single-objective cost function for the projection and the optimal linear mapping is obtained through optimizing this cost function. Additionally, we divide the nearly diagonal block kernel matrix into c kernel matrixes and use eigendecomposition method to solve their d principal vectors based on which the d approximate eigenvectors of original kernel matrix K are obtained, then the combined mapping V can be used to reduce dimensionality in which inner-class information is preserved efficiently and it can cover more larger dataset than kernel PCA. Finally, the two methods are applied to compress some datasets and the results show their validity.3. In this paper we propose a novel approach of learning preference relations using Support Vector Regression (SVR). It overcomes the problem of inconsistencies of preference and improves the ability of generalization to ranking for the property of SVR method. Meanwhile, the WMW statistic is introduced to evaluate the result of the ranking algorithm. The experiments on an artificial dataset and some benchmark datasets show the effectiveness of the proposed algorithm. An application to ranking in web searching system based on the proposed method is also demonstrated.4. Sharing nearest neighbor (SNN) similarity is a newly metric measure of similarity, and it can conquer the two difficulties: the low similarity between samples and the different density of class. At present, there are two popular SNN similarity based clustering methods: JP clustering and SNN density based clustering. The clustering results of applying them highly rely on the weighting value of the single edge, thus they are very vulnerable. Motivated by the thinking of smooth merge in computing geometry, the authors design a novel SNN density based clustering algorithm. Since it inherits complementary intensity - smoothness principle, its generalizing ability surpasses those of the other two methods. The result of experiment on a public text dataset also shows our method access the best clustering precision and recall accuracy in most cases.5. The Internet has the characteristics of openness, hierarchy, evolution, mass and is a typical complex adaptive system. So a new complex adaptive search model is proposed based on the theory of complex adaptive system. Through establishing the main union of information collection, classification, cleaning and services, a multi-agent experiment environment is formed. The learning mechanism and evolutionary mechanism are also be researched so that the search engine with the new model can be actively adapted to the complex and dynamic network environment. Meanwhile, this model can be widely used to construct those special search models.
Keywords/Search Tags:Mercer kernel, continued fraction kernel, support vector machine, classification, PCA, Maximizing Margin Criterion, Kernel PCA, Diagonal Block Matrix, Text Visualization, Preference relations, Support vector regression, WMW, Ranking, JP Clustering
PDF Full Text Request
Related items