Fast KNN Text Categorization Method Based On Improved Hash Algorithm

Posted on:2013-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:Q S Xia

Full Text:PDF

GTID:2248330371999438

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The growing popularity of the network and people become increasingly dependent on technology to make the data more and more in electronic form stored in the computer. In today’s high-speed society, in large enterprise data or the network, how to quickly and efficiently find the needed data has become an important topic.So the domestic and foreign experts have proposed a variety of techniques, such as database technology, keyword matching and text classification technique.Text classification can effectively reduce the time of searching interesting content, and effectively improve the accuracy of search results and the user experience degrees to a certain extent.The commonly used text classification techniques such as the bayesian classification technique, support vector machine classification,decision tree require a lot of time to train the classifiers, if the training texts are updated,they need re-train text classifiers. One of the big advantages of traditional KNN classifier is that if the training texts are increased, it doesn’t have to re-train the classifier.The classification accuracy rate is relatively high, so it has been very popular. However, the KNN algorithm also has its bottleneck:it need computing the similarity with all the text in all the training text set and it will waste a lot of time.This paper proposes an improved algorithm:establish some text list based on some of the features,compare features with feature of the text needed to be classified, and based on the results hash to the text subset that are most probably needed, and this algorithm will greatly improve the speed of text retrieval. Based on the overflow rate which is the quotient of the distance to the class and the text needed to classify, adjust the similarity between the texts of the class and the text needed to classify,and it obviously improves the accuracy of classification. Based on the improve the traditional tf-idf for algorithm,we select features of texts, and according to part-of-speech, sentence composition, the title of the article and summary, the location of the passage, the position of the sentence and the sentence prompt words,we adjust feature properly.The experimental result indicates that the practice can very effectively improve the accuracy of text classification.

Keywords/Search Tags:

text classification, KNN, weighted feature, part-of-speech tagging, tipwords

PDF Full Text Request

Related items

1	Research On Text Classification Method Based On Part Of Speech Tagging LDA Model
2	A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion
3	Research On Text Document Information Hiding
4	Research And Implementation On Part-Of-Speech Tagging In Automatic English Essay Scoring
5	Research And Implementation Of Modify Chinese Part-of-Speech Tagging Based On FST Technology
6	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
7	Research On Lao Language Part-of-speech Tagging With Multiple Features
8	Research On Laodian Participle And Part-of-speech Tagging Method
9	Research On The Construction Method Of Burmese Part-of-speech Tagging Corpus
10	The Research Of Text Classification Technology Based On The Part Of Speech And LDA Topic Model