Font Size: a A A

Research On Algorithms For Machine Learning And Text Mining

Posted on:2003-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q HeFull Text:PDF
GTID:1118360185995730Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In this paper, some algorithms for machine learning and text mining have been researched. It is difficult to classifying massive data by using Support Vector Machine or SVM. To solve the problem, a new universal classification method based on hypersurface is put forward to classify data in the first part. The theoritic base of the method is Jordan Curve Theorem. The contributions in the first part are as follows:1) The existence of separating hypersurface and the geometric construction of separating hypersurface is studied. Moreover, the Classsification method based on Geometric HyperSurface, abbreviated by GHSC is put forward. The characteristics of GHSC mainly list as following: i) It can directly solve the nonlinear classifying problem. It need not consider kernel function and need not make mapping from lower dimension space to higher dimension space either. ii) It is a universal and operable method to make separating hypersurface. iii) It is an interesting, convenient and manageable classifying method. It classifies data according to whether the wind number of the sample is odd or even. Therefore, it is convenient and manageable to classify data using non-convex hypersurface. iv) It is suitable to classify massive and is expected to deal with high dimension data problems.2) Using GHSC method, the programs for classifying data are designed for 2-dimension and 3-dimension space. The experimental results of typical nonlinear data discrimination show that the separating hypersurface method can solve the problem of classification of a vast amount of data (10~7) effectively. Moreover, GHSC can classify data that is distributed in very complex regions. It is clear that the classifying efficiency and accuracy have been improved by using the method.3) We explore the generalization of the GHSC method to efficiently resolve the classifying problems of multi-class.4) For high dimension data, we accepted algebra hypersurface to classify. An adaptive algorithems for the order of algebra hypersurface is put forward to avoid complex computting.In the second part of the paper, for satisfying the need of large scale text mining, some text mining technology such as text information extraction, text clustering, multi-text summarrizing, the concept and semantic space, semantic index and retravial have been studied. The more concrete content is as following:1) A HMM Model for concrete BibTex entries is built, and this model is extended to open data set. Then we optimize the model through introducing smoothing technologies and extracting rules to improve the accuracy of information extraction. The experiments show that both smoothing technologies and extracting rules are effective optimization methods, and they improve the accuracy of information extraction.2) We select the SOM (self-organizing maps) and fuzzy clustering for the...
Keywords/Search Tags:Machine Learning, the classification method based on hypersurface, text clusteing, Hidden Markov Model, Information Extraction, Self-Organizing Maps (SOM), multi-abstract, concept semantic space, fuzzy direct cluster, semantic index
PDF Full Text Request
Related items