Font Size: a A A

Research On Removing Duplicated WebPages Algorithm Of Search Engine Based On Content

Posted on:2011-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2178360332958320Subject:Information Science
Abstract/Summary:PDF Full Text Request
Along with the development of information era,the value of the resources on the net has been growing better for the real world.The web has become a significant way of getting the information released and obtained for web users,so the amount of different type of information resource has been presenting a explosive growth.Search engines which is designed to search and reorganize information on the web has been the most important tool to retrieve the information for users. However,in the billions of web pages,there are a large amount of duplicated web pages most of which are from being reprinted,and some of replicas are exactly the same in content as well as some of them sectionally.These replicas of web pages has brought a major burden for the search engines.Besides,they are affecting the performance of the search engines badly, so the experience of users to the search engines is surely influenced. In order to enhance the retrieval quality, detection of duplicated web pages has become a serious problem that search engines have to face.So detection of duplicated web pages has become a very hot spot in information retrieval field in these years.The web replica deletion mainly consists of two parts:First,deal with the original website,mainly the format conversion of web pages, the web noise purification and extraction of the theme of the web on the website. The other focuses on the replica detect based on the content of web pages.Lots of related works have been done in home and abroad,web similarity detection method is mainly divided into three aspects:based on URL analysis,syntax analysis and semantic analysis.This paper is divided into four sections,the first chapter devoted to the subject background,the main task of the subject.The second chapter describes the existing web pages purification methods,the concept and use of DOM tool and then put forward a theme extraction methods based on tag-window..The principle is that nekohtml parsed the web pages to a tag tree in memory. And then use standard DOM API methods which is implemented by nekohtml,a popular web-parse tool,to traverse the DOM tree.After removing the URLs,images and scripts,tag windows that consist of html tag along with the textual content in it are extracted during traversing.Every tag windows is then given a weight,which is calculated mainly based on the grammatical features of content in the html tag.The tag window that has the biggest weight includes the theme of the page.Chapter 3 analyzes the exiting web similarity detection method in detail,and then presents our algorithm which is based on big chunk and long sentence of the page.In our method,we propose bloom filter testing similar methods as well as a brand new set of indicator to compute the similarity and containment of two pages.After noise purification of web was extracted of the theme of the web features are extracted from the long sentences in big chunks of each document.Hashing all features of the web,then each webpage own a bloom filter.When a new webpage was captured,in accordance with the above steps,its bloomf filter will be achieved.And then compare with BloomFilters of the webpage stored in storage by computing Containment and Similarity between them,if more than a certain value,such as 80%,the new webpage is a copy.Chapter 4 is the implement and analysis of our duplicated webpages deletion algorithm,and analyzes deeply the response time of the BloomFilter when it is used in the similarity detection based on the content.
Keywords/Search Tags:Tag-Window, content-extraction, duplicated webpages deletion, feature code, big paragraph, long sentence, BloomFilter
PDF Full Text Request
Related items