Font Size: a A A

Research Of Improved Near-Duplicate Removal

Posted on:2012-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y NieFull Text:PDF
GTID:2178330335960303Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of Internet, the information from the website grows heavily. The growth creates a problem when we need to locate information precisely. Although the search engine is a common way of solving the problem, if we can remove some of the duplicate information, it will be a good plus.The paper provides a new algorithm to deal with the problem above. It includes two parts:the abstraction of key words and the improved LCS (longest common subsequence) similarity algorithm.How to get the key words:according to the characteristic of the next step of algorithm, after segmentation of the material, firstly, we calculate the term frequency of each word, then we adjust the score of each word by checking if the word is associated with the title. After sorting the words by the scores, we can set the top several words as the key words. At last, we remove all other words except key words we selected and keep the remaining words as the frame or called "real key words" used in the next steps.When it comes to the step of calculation of similarity, the paper applies the improved LCS algorithm. The new algorithm adds pair of words and weights instead of only words in old LCS algorithm, so it can maximize the semantic information.After the calculation, the system splits the materials into two groups: non-duplicate and duplicate, there will be an index to record the non-duplicate and a txt file to record the duplicate.According to the results of the experiment we carried out, it approves that the proposed method is better than the old LCS and classic VSM algorithm.
Keywords/Search Tags:information retrieval, duplicate detection, fingerprint, improved LCS algorithm
PDF Full Text Request
Related items