Researches on the algorithms of text categorization and text clustering are done in thispaper. We analyse some critical technologies and problems, and make some improvements.Firstly, Vector Space Model and methods of term weight computing are introduced, and wecompare several good methods of feature selection. Then, we selectively analyse twoclassification algorithms: SVM and KNN, whose performances are better than others. Ourexperiments on this two methods show that the stability of KNN is better than that of SVM,so we pick it into our real system.As KNN is a algorithm based on sample instances, the slow speed of classifying is abig problem. We propose an idea that document samples are replaced by less semanticcenters to overcome this problem. Text clustering is used to construct the semantic centers,and we expatiate the nearest neighbour clustering algorithm and its specific problems. Andsome means of tuning parameters dynamicly are used to optimize the clustering quality.For the problem of initial clustering centriods, we improve an existing algorithm andpresent details of the corresponding algorithm flow.Finally, our experiments evaluate the above algorithms on several different-sizedatasets and the results show that our KNN classification algorithm based on semanticcenters greatly improve the classifying speed with high precision. |