Research On Detection Of Duplicated Web Pages With Bool-Model

Posted on:2007-09-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Lian

Full Text:PDF

GTID:2178360185954106

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of information technology, more and more information appears in theinternet. Internet has already been a kind of means for people to get the information they need.But sometimes internet users only know the keywords of the information they want to find, theyeven don't know the link of the web page. So developing a tool to help the users to find theinformation they need became a research field in natural language processing. Inspired by textretrieval technology, search engine was developed by some agency, making it convenient forinternet users to get the information they need.The appearance of search engine brought great convenience to the process of findinginformation from the internet. And it is warmly welcomed by internet users. After that, moresearch engines appeared, such as Google for multi-language and Baidu for Chinese. However,prompted by business interest to get a high rank, many websites often copy messages from otherwebsites. So the search engine often returns many different links with same contents. This notonly heavies the burden of the search engine's processor, but also reduces the effectiveness of theretrieval results.Also, this requires users to spend more time to get the retrieval results they want.Detection of duplicated web pages is to increase the retrieval efficiency of search engines andto increase the effectiveness of the retrieval results.Based on the analysis of web pages'contents;we put forward two methods to detect duplicated web pages: detection of duplicated pages withbool-model and detection of duplicated web pages with high frequency words.One of them extracts features based on the word frequency, and a text is transformed to afeature string, and then uses the feature string to recognize the duplicated web pages.The otherone makes use of bool-model to represent a text and calculate the Hamming distance betweentwo texts. In this way, the number of comparison is decreased.Our research concerns on the following:1 Make a research on why duplicated web pages increase and make a comparison in all kindsof detection Algorithms.2 Make a comparison between two kinds of definition of "duplicated" and on the basis of thiscomparison, we design the recognition algorithm of duplicated web pages according to Pugh'definition.3 Make several tests of our algorithm;finally, we find it is of high efficiency.

Keywords/Search Tags:

detection of duplicated web page, bool-model, feature string, hamming distance

PDF Full Text Request

Related items

1	Research On NLP-Based Duplicated Web Pages Deletion Algorithm
2	Parallel Approximate String Matching Algorithm Based On Gpu
3	Web Page Weight Elimination Technique Research And Implementation
4	The Research About Text Similarity Measuring Through Hamming-Distance And Semantics
5	Research On Search Engine Based On Web Page Mining
6	Research And Implementation On The Parallelization Of The Near-duplicate Page Removal Algorithm
7	Research On PUF Circuit Design Based On Memristor Array
8	Research On Memristor-based Hamming Distance Calculation Methods And Realization
9	Research Of Large-scale Text Collection Duplicated Deletion
10	Sequentially Matching Similarity String Algorithm Research