The online-test system of C program language is based on examination database. Because the original system lacks of duplication checking module, it is hard to avoid similar questions in examination database.Consequently, the quality of test paper and effect of examination would decrease. So, how to quickly and accurately find these similar test questions is what this paper would like to do.The duplicate checking of C test questions belongs to the similarity calculation in NLP. After study large amount of researches on the similarity calculation, this thesis would like to solve this problem in three procedures, they are word segmentation, word similarity calculation and sentence similarity calculation.In the aspect of segmentation, this thesis chooses ICTCLAS tool which is highly practical and reliable. It’s easy to extend original dictionary and part of speech. In word similarity calculation procedure, firstly, this thesis studies some knowledge system, such as "Chinese Thesaurus", "How Net" and "domain ontology". Then, domain ontology of C program language is constructed. Finally, "domain ontology" and "how net" are used to count the similarity of conceptions. In domestic, for sentence similarity calculation, there are many relative methods based on word sense, word order and syntax features. As the C similar test questions have less word changed and have fixed word sequence, this thesis selected "Levenshtein Distance" algorithm to calculate sentence similarity.In general, firstly, ICTCLAS is selected to split words and mark on part of speech. Secondly, C domain ontology is used to calculate domain word similarity. Lastly, "Levenshtein Distance" algorithm is used to count sentence’s similarity, in which the operation costs are different with each other because of the different parts of speech. Experiments show that these methods are very effective and accurate in identifying similar C test questions, so, the problem is solved basically. |