Font Size: a A A

Research On Search Engine Based On Web Page Mining

Posted on:2007-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y G HuangFull Text:PDF
GTID:2178360185485863Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the development of internet and information, search engines become more and more popular. It combines disorganized information and provide much more efficient and useful information. The basic element of search engines is the web page, so this paper focuses on time information mining, structure mining and fingerprint mining of web pages. Accordingly, this paper expatiates on how to improve the retrieval results of search engines from such three aspects: incremental crawling, web pages purification and duplicated web pages deletion.On incremental crawling module, according to the rapid updating frequency of news web sites, this paper adopts a way of mining time information of web pages. It makes use of the time information of pages to decrease the times of crawling web pages and searching in databases. Thus we have efficiently solved the problem of incremental crawling the web sites with high updating frequency, and new web pages can be obtained in good time by users. On page purification module, we describe a web page as a DOM tree, and introduce the number of Chinese punctuation into the weight of page content. By means of pruning the DOM tree, noises of the web page can be reduced. On duplicated pages deletion module, we bring forward a method based on web page purification. By doing the web page purification and distilling characteristic of fingerprint, the accuracy of duplicated pages deletion can be advanced effectively. Furthermore, through applying this deletion method to the clustering of abnormal short texts, we also get fine results.Experiments show that methods in this paper can greatly improve the performance of search engines and provides satisfying results.
Keywords/Search Tags:search engine, Information gathering, web page analysis, duplicated web page deletion, clustering
PDF Full Text Request
Related items