Font Size: a A A

Study And Applications Of Duplicate Web Page's Elimination And Clustering Algorithm In Search Engine System Of Colleges And Universities

Posted on:2011-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:C H DangFull Text:PDF
GTID:2178360302980078Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present, most search engines' results have the issue of large number of duplicate and reproduced pages, and meanwhile Chinese search engine's web clustering is still in the phase of beginning, many technologies cannot be applied to the practical application. With China's colleges and universities' increasing in number and size every year, laggard of college search engine technologies' disadvantages are also surfaced. To solve these problems, the paper emphases on the search engine's basic working principle, the current duplicate webpage's elimination algorithms and web page clustering technologies' study and discussion, designs the colleges and universities search engine system. It also focused on the following areas to do research and applications:First, web page representation model in web page pre-process has been studied and improved: to the follow-up clustering work, study on the vector space model, and improve the existing vector space model on weight to solve the problem that the original model does not reflect the characteristics that different items location has different weights and semantic issues etc.; in order to get the web page models, block-based page extraction has been studied, and the specific process of page block algorithm is given.Second, research and improve the duplicate webpage's elimination algorithm: for the huge number of duplicated and reproduced web pages, study on the elimination algorithm of approximate web pages based on distance and algorithm of sub-signature on the fall text, of which the former algorithm introduce the computing of web pages' similarity to improve the quality of the elimination of duplicate web pages; for the latter algorithm bases on the original fall text sub-signature greatly reduces the data set, solve the problems of slow in the original method.Third, study and improve the existing clustering algorithms: conduct a comparative study on existing algorithms, includes K-means clustering algorithm, EM Clustering algorithm, clustering algorithm based on tolerant rough sets and the min-max super-box clustering algorithm; one novel clustering algorithm based on rotated super-box and another novel clustering algorithm based on tolerant rough set and rotated super-box are proposed based on the original algorithms, in which: the former makes use of fuzzy theory in the super - box on the cluster theory to define and solve the original clustering algorithms' limitations on recognizing and processing the cluster shape; the latter algorithm takes advantage of the tolerant rough set theory to solve the problems that most of the clustering algorithms focus on the characteristic that within-class all the objects are as similar as possible and otherwise the opposite and don't consider the presence of cross-semantics. The latter clustering results are more understandable.Fourth, proposes the search engine system of colleges and universities based on the existing search engine and gives the application of duplicate webpage's elimination algorithms and clustering algorithms on it: the former gives the search engine of colleges and universities' system structure, working principles and work flow; the latter gives application process of a variety of algorithms, which include page pretreatment processes: text extraction, duplicate webpage's elimination algorithms, and web page clustering process of K means clustering algorithm, K-mediods algorithm, ..., the min-max super-box clustering algorithm. Finally, gives the evaluation and comparison of various algorithms.The experiment verified the basic problems faced by the search engine system of colleges and universities, and is superior to most of the existing algorithms.
Keywords/Search Tags:clustering, duplicate webpage's elimination algorithm, vector space model, search engine
PDF Full Text Request
Related items