Font Size: a A A

Research Of Chinese News Web Page Duplicate Detection

Posted on:2015-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z WeiFull Text:PDF
GTID:2308330482978876Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays, the Internet has been the major way to obtain and spread the information by us people. Specially, the web pages are the most important information carrier on the Internet. Unfortunately, the huge duplicate web pages existed on the Internet have given us much trouble when we are surfing the Internet. Thus we need to detect the duplicate web documents. The news web page is one type of the mostly browsed web pages. If we can solve the duplicate news web page detection problem, we can improve the efficiency of information acquisition to a great extent. This paper makes an effort to find a method to improve the accuracy and efficiency of duplicate Chinese news web page detection.Duplicate document detection is a very important problem in information retrieval domain. For the reason that the structure and content of web page is quit complex, the duplicate web page document detection is more important. There is much traditional classic related work in duplicate web page detection, but most of them are proposed to handle the English text. As the fact that the syntax and semantic of Chinese text are very different to English, the current duplicate web page detection methods are not available to handle the Chinese web page, so they cannot achieve good accuracy in detecting duplicate Chinese web page. More seriously, with the rapid increasing web pages on the Internet, duplicate web page detection need to find a good way to solve the huge data processing problem.In our study, we found that Chinese period is very important in duplicate Chinese news web pages detection. For one thing, Chinese periods are usually appeared in the main content of Chinese web page and they are not appeared in the other content like the advertisements, links, and copyrights. For another thing, Chinese period feature can be used to compute the similarity of web pages.Based on the above theory, this paper proposed a duplicate Chinese web page detection alogrithm called CCDet. Firstly, the CCDet algorithm proposed a new similarity measurement called CCS and CLR. This measurement can measure the duplicate and containment relation in the means time. Secondly, the CCDet algorithm used the Chinese period feature to compute the similarity of web pages and proposed a noise feature filtering method called index pruning. Finally, to handle the huge dataset processing problem, this paper used MapReduce framework to achieve the CCDet algorithm in parallel. The experimental result shows that the CCDet algorithm can achieve better accuracy and efficiency than the traditional algorithms, and the parallel CCDet algorithm achieved good scalability.To prove that the CCDet algorithm is available when it is applied to distributed search engine. This paper designed a distributed search engine with duplicate web page detection function, which is called Bingo. Based on the open source framework Hadoop and Nutch, Bingo is deployed on a distributed environment so that it can process the huge web pages which are crawled from the Internet every day. In the mean time, Bingo can detect the duplicate web pages from the searching results for the users and give them a more reasonable index structure. The searching results of Bingo show that the CCDet algorithm is available.
Keywords/Search Tags:CCDet Algorithm, Duplicate Web Page Detection, Chinese Period Feature, Index Pruning, Bingo
PDF Full Text Request
Related items