Font Size: a A A

Research And Implementation Of Finding Duplicate Science Project Based On Non-segmentation Techniques

Posted on:2011-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:C ZuoFull Text:PDF
GTID:2189360308458073Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With more and more investments have being put into science research, how to manage these funds is concerned by the officers. To avoid duplicate projects and improve efficiency, in the approval management of science projects, peploe always want to find similar science projects (Finding Duplicate Science Project for short).To get the similarity between different science project applications, we can solve the above problem. Essentially, science project application similarity calculation is Chinese text similarity calculation. Traditionally, to get the similarity between different Chinese texts, pepole use words, which are got from Chinese text by segmentation technology, as a text's features. The result of Chinese text segmentation relies heavily on the quality of its dictionary, so it always can't get the domain nouns, which are the most important features, from applications.Using suffix tree to get the same parts among applications and mining frequent closed itemsets of suffix tree's nodes as features, this paper constructs a vector space model, Frequent Closed Suffix-tree Nodeset Vector (FCSNV), to calculate the similarity between different science project applications.This paper finishes the following work:①Using the thought of Ukkonen algorithm, we construct the suffix tree of the collection of science project applications;②Using CHARM algorithm to mining frequent closed itemsets of suffix tree's nodes;③Using frequent closed itemsets to construct a vector space model to represent science project application, and valuating the effectiveness of this model in computing Chinese text similarity;④Using .net platform to implement our algorithm.To sum up, this paper proposes a non-segmentation technology to finding duplicate science projects. Our method can effectively find the features of science project application, and have a good performance in computing science project application similarity.
Keywords/Search Tags:Finding Duplicate Science Project, Non-segmentation, FCSNV, Suffix Tree, Frequent Closed Itemset
PDF Full Text Request
Related items