Font Size: a A A

Fast KNN Text Categorization Method Based On Improved Hash Algorithm

Posted on:2013-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q S XiaFull Text:PDF
GTID:2248330371999438Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The growing popularity of the network and people become increasingly dependent on technology to make the data more and more in electronic form stored in the computer. In today’s high-speed society, in large enterprise data or the network, how to quickly and efficiently find the needed data has become an important topic.So the domestic and foreign experts have proposed a variety of techniques, such as database technology, keyword matching and text classification technique.Text classification can effectively reduce the time of searching interesting content, and effectively improve the accuracy of search results and the user experience degrees to a certain extent.The commonly used text classification techniques such as the bayesian classification technique, support vector machine classification,decision tree require a lot of time to train the classifiers, if the training texts are updated,they need re-train text classifiers. One of the big advantages of traditional KNN classifier is that if the training texts are increased, it doesn’t have to re-train the classifier.The classification accuracy rate is relatively high, so it has been very popular. However, the KNN algorithm also has its bottleneck:it need computing the similarity with all the text in all the training text set and it will waste a lot of time.This paper proposes an improved algorithm:establish some text list based on some of the features,compare features with feature of the text needed to be classified, and based on the results hash to the text subset that are most probably needed, and this algorithm will greatly improve the speed of text retrieval. Based on the overflow rate which is the quotient of the distance to the class and the text needed to classify, adjust the similarity between the texts of the class and the text needed to classify,and it obviously improves the accuracy of classification. Based on the improve the traditional tf-idf for algorithm,we select features of texts, and according to part-of-speech, sentence composition, the title of the article and summary, the location of the passage, the position of the sentence and the sentence prompt words,we adjust feature properly.The experimental result indicates that the practice can very effectively improve the accuracy of text classification.
Keywords/Search Tags:text classification, KNN, weighted feature, part-of-speech tagging, tipwords
PDF Full Text Request
Related items