Font Size: a A A

Research Of Improvement To The Density-based Method For Reducing The Amount Of Training Data And Application To KNN

Posted on:2011-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y H YangFull Text:PDF
GTID:2178360308958255Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology and popularization of the Internet, large volumes of information can be acquired conveniently and quickly. However, how to quickly and accurately find the right information in the vast information ocean has become a realistic problem which people have to face. It becomes an urgent requirement that massive information could be managed in a well-organized way and could be efficiently utilized. At the present time, most information exists as text. For effective utilization of information, the efficient and reasonable classification for information is very necessary. Therefore, text classification has become a key technology for vast text information processing and has gradually become an important research branch in the field of data mining.Researches on text classification and its related technologies are done in this paper. The thesis firstly introduces general development of automated text categorization. Specially, some introductions are made such as text preprocessing, text representation, feature selection, feature weighting, kNN (k nearest neighbor), the density-based method for reducing the amount of training data, classification performance evaluation and so on. Then, we focus on the research and analysis of kNN algorithm and the density-based method for reducing the amount of training data. Our primary works are as follows.Firstly, we propose an improved approach to the density-based method for reducing the amount of training data in kNN. The density of training data directly affects the efficiency and precision of kNN text classifier. Through the analysis of density-based method for reducing the amount of training data in kNN text classifier, two disadvantages have been uncovered: one is the imperfect state of the even density of the training data after reduced, which should be equal distance of every two training texts; and the other is absolutely no treatment of the low-density training texts, there are large numbers of low-density texts in the training data after reduced. An improved approach to the mentioned deficiencies is proposed: the reducing strategy is optimized and the method of supplementing appropriate data into training data is presented. The experiment shows that the improved method has a distinctly better performance both on the algorithm stability and accuracy.Secondly, an improved approach to kNN is proposed. There wasn't a proper approach to figure the optimal k value of the original kNN algorithm, in which an initial value was generally set between a few hundred and several thousand, and then it was adjusted corresponding to the experimental results. Actually, this is not a smart choice to promote kNN algorithm in practice. Concerned with the deficiency, an improved approach is proposed based on the density-based method for reducing the amount of training data in this paper. The improved algorithm is briefly as follows: we find out the k nearest neighbors which are in theε-neighborhood of the new text. Then we classify the new text based on the k nearest neighbors. Results show that, the improved algorithm can solve the problem of the k value's determination in kNN algorithm better and meanwhile has superior time efficiency. As for category efficiency, they are largely the same.
Keywords/Search Tags:text classification, kNN, fast classification, reducing training data, supplementing training data
PDF Full Text Request
Related items