Font Size: a A A

Research And Application On Automatically Detect Duplication Technology In Internet

Posted on:2007-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:G H BaiFull Text:PDF
GTID:2178360185454116Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the incremental usage of the Internet,information is exploding, which gets in theway of the real time information retrieval.. The research of removing duplicate documents isvery significative.The study of detection of duplicated pages in theory and application are both introducedin this paper,and the achievements are as follows:Firstly,this paper made an extraction of these traditional algorithms and analyzed theefficiency, correctness and recall rate.Secondly, improved the existed algorithm, get a new algorithm of the string of featurecode which is based on frequency of the single word. The experiment indicates the improvedalgorithm is better than the traditional ones in both the process speed and the recall rate.Again, a new algorithm of the inversion list is advanced which base on the frequency of theword . Through testing the performance of the algorithm and the actual internet, we provedthat the correctness and the recall rate are improved a lot, and we applied this algorithm inCIS system, and got a very good feedback.
Keywords/Search Tags:chinese information processing, string of feature code, online duplicate documents detection, Support Vector Machine (SVM), Vector Space Model(VSM), inversion list, comparability search
PDF Full Text Request
Related items