Font Size: a A A

Heavy Title Detection Of Text Classification And Similarity-based Study

Posted on:2009-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:T LiangFull Text:PDF
GTID:2208360242997471Subject:Computer software theory
Abstract/Summary:PDF Full Text Request
Today, English is becoming more and more important to us. More and more kinds of English Tests are developed. In the Same Time, the universal of computer and the development of Internet make on-line English study and test become an inevitable thing. To establish subjects'database quickly, getting the subjects from many expert is an effective method. But this method can bring similar subjects and duplicate subjects easily, and affect the result of test. For this reason, we should develop the technique of judging duplicate subjects.In the paper's opinion, we should differentiate between "similar subjects" and "duplicate subjects". "Similar subjects" have special purpose to formulate questions for a test or examination, and be allowed to input to database if we have signed them. "Duplicate subjects" are useless to database, and must to be rejected. So, we use the method of test classification and similarity computing to check "similar subjects" and "duplicate subjects".Firstly we will sieve suspicious subjects out form database. Then we should import the method of text classification to compute the similar degree of backbone information, and check out the "similar subjects" and sign them. For this objective, we should use Vector Space Model to indicate the text and use TF-IDF to compute weight of the feature and use information gain to check the features. At last we should use Naive Bayes algorithm to do text classification.We import a new method to compute the similarity of text, that is, the algorithm based on Hamming distance. We base on the theory of Hamming distance, to construct the new formula to compute the similarity of the different texts and the queries; we compare this new method with the others. It has some advantages over the others.According to the result of test, this method, the accurate rate of searching "similar subjects" and checking "duplicate subjects" all above 90 percent and the speed is also high enough. Briefly, the effect of test is satisfied.
Keywords/Search Tags:text classification, similarity, duplicate subjects checking, VSM
PDF Full Text Request
Related items