Research Of Improved Near-Duplicate Removal

Posted on:2012-02-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y Nie

Full Text:PDF

GTID:2178330335960303

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the development of Internet, the information from the website grows heavily. The growth creates a problem when we need to locate information precisely. Although the search engine is a common way of solving the problem, if we can remove some of the duplicate information, it will be a good plus.The paper provides a new algorithm to deal with the problem above. It includes two parts:the abstraction of key words and the improved LCS (longest common subsequence) similarity algorithm.How to get the key words:according to the characteristic of the next step of algorithm, after segmentation of the material, firstly, we calculate the term frequency of each word, then we adjust the score of each word by checking if the word is associated with the title. After sorting the words by the scores, we can set the top several words as the key words. At last, we remove all other words except key words we selected and keep the remaining words as the frame or called "real key words" used in the next steps.When it comes to the step of calculation of similarity, the paper applies the improved LCS algorithm. The new algorithm adds pair of words and weights instead of only words in old LCS algorithm, so it can maximize the semantic information.After the calculation, the system splits the materials into two groups: non-duplicate and duplicate, there will be an index to record the non-duplicate and a txt file to record the duplicate.According to the results of the experiment we carried out, it approves that the proposed method is better than the old LCS and classic VSM algorithm.

Keywords/Search Tags:

information retrieval, duplicate detection, fingerprint, improved LCS algorithm

PDF Full Text Request

Related items

1	Research Of Automated Duplicate Bug Report Detection
2	Research, Large-scale Approximation Paragraph Fingerprint-based Web Page Detection Algorithm
3	A Detection Method Of Duplicate Defect Reports Based On Fusing Text And Categorization Information
4	Research And Implementation On The Parallelization Of The Near-duplicate Page Removal Algorithm
5	Near Duplicate Image Retrieval Based On Geometric Information
6	Research On The Effectiveness Of Duplicate Bug Report Detection Based On Deep Learning
7	Research Of Deleting Duplicate Web Pages On Campus Search Engine
8	Research On Near-Duplicate Image Detection And Its Application
9	Near-Duplicate Video Retrieval And Copy Video Detection
10	Research On Near-duplicate Video Detection Based On Correlation Analysis