Font Size: a A A

Web Resource, Page Cleaning And Classifying

Posted on:2007-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178360182987483Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the information in Web becomes more and more.This causes that it is difficult to get useful information.Now, search engine which is used to search information only match key words.It can't understand the meaning of customer and it may return a lot of unuseful information. Intelligentized search engine can solve this problem. It has lots of parts, such as web resurce discovery, pages cleaning, pages classifying, pages cluatering, information extraction and etc. This paper aimes at web resurce discovery, pages cleaning and pages classifying. Some works have been done in this paper. (1) This paper introduces the theory of web resurce discovery and compares some algorithms which used to this field. (2) In order to obtain more pages correlated with the topic more effectively; this paper puts forward a new algorithm. First, this algorithm creates a domain-oriented Ontology. Second, it computes the relevance-score of links and pages. In the last, it decides the crawling direction. Experiments show this algorithm is good in relevant rate. (3) This paper introduces the theory of web cleaning and analyses some algorithms of page cuting and web cleaning. (4) VIPS-based page cleaning is put forward. First, it ues VIPS to part some pages int sub pages and saves these sub pages which are not image into database; second, it statistics the arisen time of every sub page by calculating the similar degree. Last, it calculates the weight of every sub page according arisen time, the length of text in the sub page, the position of the sub page and the amounts of URL in the sub page. Nosie is the sun page which has small weigth. Tests show this algorithm can...
Keywords/Search Tags:search engine, web resource discovery, page cleaning, page classifying
PDF Full Text Request
Related items