Font Size: a A A

Research On The Algorithm For Chinese Duplicated Web Pages Detection

Posted on:2011-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:H TuFull Text:PDF
GTID:2178360308460888Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the wide spread and rapid growth of the Internet, the information on the Internet is also growing explosively. As a result the search engine is becoming an important tool for people to gain what they want exactly and therefore it is paid more and more attention. How to detect the duplicated web pages is always the hot spot and key point, thus I research on how to improve and increase the efficiency of the algorithm and make it performance better.By comparing the current kinds of ways to detect duplicated web pages I drew the conclusion that the method based on content can performance better, and it would not be improved obviously adding the link or the anchor-window, so I focus on the web-content detection algorithm.Digital Syntactic Clustering algorithm is a typical web-content detection method that is widely used. It extracts feature of web pages based on grammar. The result shows that it is not suitable for detecting short web pages. Google did some experiments and evaluated that it should performance better taking frequency into consideration. I applied several strategies including statistical frequency, natural language understanding and so on. The weight of keyword in the improved algorithm was computed using TF, IDF, the word position according to a certain proportion. Besides I applied VSM to attain the similarity of web pages. The process was executed as follows:firstly I parsered the HTML web pages and then the pure web texts were processed using Chinese word segmentation. Thus the web pages could be represented in VSM as a vector using the statistical weight of keyword and then I could get the mutual similarity and judged whether they were the duplicated or not.The experiments were also done later. I compared the results and found that the improved duplicated web pages detection algorithm could performance better.At last I also proposed some other aspects worth further research and explained what should do.
Keywords/Search Tags:duplicated web pages detection, the similarity of web pages, DSC algorithm, TF/IDF, the position of words
PDF Full Text Request
Related items