Research Of Improvement To The Density-based Method For Reducing The Amount Of Training Data And Application To KNN

Posted on:2011-09-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Yang

Full Text:PDF

GTID:2178360308958255

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of information technology and popularization of the Internet, large volumes of information can be acquired conveniently and quickly. However, how to quickly and accurately find the right information in the vast information ocean has become a realistic problem which people have to face. It becomes an urgent requirement that massive information could be managed in a well-organized way and could be efficiently utilized. At the present time, most information exists as text. For effective utilization of information, the efficient and reasonable classification for information is very necessary. Therefore, text classification has become a key technology for vast text information processing and has gradually become an important research branch in the field of data mining.Researches on text classification and its related technologies are done in this paper. The thesis firstly introduces general development of automated text categorization. Specially, some introductions are made such as text preprocessing, text representation, feature selection, feature weighting, kNN (k nearest neighbor), the density-based method for reducing the amount of training data, classification performance evaluation and so on. Then, we focus on the research and analysis of kNN algorithm and the density-based method for reducing the amount of training data. Our primary works are as follows.Firstly, we propose an improved approach to the density-based method for reducing the amount of training data in kNN. The density of training data directly affects the efficiency and precision of kNN text classifier. Through the analysis of density-based method for reducing the amount of training data in kNN text classifier, two disadvantages have been uncovered: one is the imperfect state of the even density of the training data after reduced, which should be equal distance of every two training texts; and the other is absolutely no treatment of the low-density training texts, there are large numbers of low-density texts in the training data after reduced. An improved approach to the mentioned deficiencies is proposed: the reducing strategy is optimized and the method of supplementing appropriate data into training data is presented. The experiment shows that the improved method has a distinctly better performance both on the algorithm stability and accuracy.Secondly, an improved approach to kNN is proposed. There wasn't a proper approach to figure the optimal k value of the original kNN algorithm, in which an initial value was generally set between a few hundred and several thousand, and then it was adjusted corresponding to the experimental results. Actually, this is not a smart choice to promote kNN algorithm in practice. Concerned with the deficiency, an improved approach is proposed based on the density-based method for reducing the amount of training data in this paper. The improved algorithm is briefly as follows: we find out the k nearest neighbors which are in theε-neighborhood of the new text. Then we classify the new text based on the k nearest neighbors. Results show that, the improved algorithm can solve the problem of the k value's determination in kNN algorithm better and meanwhile has superior time efficiency. As for category efficiency, they are largely the same.

Keywords/Search Tags:

text classification, kNN, fast classification, reducing training data, supplementing training data

PDF Full Text Request

Related items

1	Research And Implementation Of Classification Model On Big Data In Healthcare Based On Semi-supervised Learning Algorithm
2	Research And Implementation Of Image Recognition Model Online Fast Training S Ystem For Small Scale Data
3	Research On Text Representation And Text Classification Method Based On Adversarial Training
4	Image Classification Method Based On Deep Learning And Accelerated Training Technique
5	Research And Implementation Of Semi-Supervised Based Self-Training Classification Model
6	A Subject Classification To News Text Data Based On BERT Pre-training Model And VAE Feature Reconstruction
7	Research On Short Text Classification Of Semi-supervised Pre-training Based On Autoencoders And Word Order Dependencies
8	Research On Multi-label Text Classification For Imbalanced Data
9	Research On Text Classification With Noisy Labels
10	Gender Classification Based On Micro-blog Text And Social Information