Font Size: a A A

Research On Detection Of Duplicated Web Pages With Bool-Model

Posted on:2007-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:H LianFull Text:PDF
GTID:2178360185954106Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology, more and more information appears in theinternet. Internet has already been a kind of means for people to get the information they need.But sometimes internet users only know the keywords of the information they want to find, theyeven don't know the link of the web page. So developing a tool to help the users to find theinformation they need became a research field in natural language processing. Inspired by textretrieval technology, search engine was developed by some agency, making it convenient forinternet users to get the information they need.The appearance of search engine brought great convenience to the process of findinginformation from the internet. And it is warmly welcomed by internet users. After that, moresearch engines appeared, such as Google for multi-language and Baidu for Chinese. However,prompted by business interest to get a high rank, many websites often copy messages from otherwebsites. So the search engine often returns many different links with same contents. This notonly heavies the burden of the search engine's processor, but also reduces the effectiveness of theretrieval results.Also, this requires users to spend more time to get the retrieval results they want.Detection of duplicated web pages is to increase the retrieval efficiency of search engines andto increase the effectiveness of the retrieval results.Based on the analysis of web pages'contents;we put forward two methods to detect duplicated web pages: detection of duplicated pages withbool-model and detection of duplicated web pages with high frequency words.One of them extracts features based on the word frequency, and a text is transformed to afeature string, and then uses the feature string to recognize the duplicated web pages.The otherone makes use of bool-model to represent a text and calculate the Hamming distance betweentwo texts. In this way, the number of comparison is decreased.Our research concerns on the following:1 Make a research on why duplicated web pages increase and make a comparison in all kindsof detection Algorithms.2 Make a comparison between two kinds of definition of "duplicated" and on the basis of thiscomparison, we design the recognition algorithm of duplicated web pages according to Pugh'definition.3 Make several tests of our algorithm;finally, we find it is of high efficiency.
Keywords/Search Tags:detection of duplicated web page, bool-model, feature string, hamming distance
PDF Full Text Request
Related items