Heavy Title Detection Of Text Classification And Similarity-based Study

Posted on:2009-05-11

Degree:Master

Type:Thesis

Country:China

Candidate:T Liang

Full Text:PDF

GTID:2208360242997471

Subject:Computer software theory

Abstract/Summary:

PDF Full Text Request

Today, English is becoming more and more important to us. More and more kinds of English Tests are developed. In the Same Time, the universal of computer and the development of Internet make on-line English study and test become an inevitable thing. To establish subjects'database quickly, getting the subjects from many expert is an effective method. But this method can bring similar subjects and duplicate subjects easily, and affect the result of test. For this reason, we should develop the technique of judging duplicate subjects.In the paper's opinion, we should differentiate between "similar subjects" and "duplicate subjects". "Similar subjects" have special purpose to formulate questions for a test or examination, and be allowed to input to database if we have signed them. "Duplicate subjects" are useless to database, and must to be rejected. So, we use the method of test classification and similarity computing to check "similar subjects" and "duplicate subjects".Firstly we will sieve suspicious subjects out form database. Then we should import the method of text classification to compute the similar degree of backbone information, and check out the "similar subjects" and sign them. For this objective, we should use Vector Space Model to indicate the text and use TF-IDF to compute weight of the feature and use information gain to check the features. At last we should use Naive Bayes algorithm to do text classification.We import a new method to compute the similarity of text, that is, the algorithm based on Hamming distance. We base on the theory of Hamming distance, to construct the new formula to compute the similarity of the different texts and the queries; we compare this new method with the others. It has some advantages over the others.According to the result of test, this method, the accurate rate of searching "similar subjects" and checking "duplicate subjects" all above 90 percent and the speed is also high enough. Briefly, the effect of test is satisfied.

Keywords/Search Tags:

text classification, similarity, duplicate subjects checking, VSM

PDF Full Text Request

Related items

1	Research And System Development Of Content Duplicate Chechking In E-business Website Based On Semantics
2	Reaearch And Implementation Of Duplicate Checking System Under Internet Environment
3	Research And Implementation Of Text Duplication Check With Fuzzy Matching Algorithm In Cloud Computing Environment
4	Design And Implementation Of An Automatic Collection And Classification System For Web Text
5	Research And Design Of New Media Manuscript Duplicate Checking System Based On Spring Boot
6	A Detection Method Of Duplicate Defect Reports Based On Fusing Text And Categorization Information
7	Study On Chinese Text Classification Technology Based On Improved Text Similarity Algorithm
8	Research On PU Text Classification Based On Similarity Method
9	The Study On Duplicate Checking Of Python Programming Assignments
10	Improving memory for maps and text through dimension, color, and serial order