Research On Search Engine Based On Web Page Mining

Posted on:2007-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y G Huang

Full Text:PDF

GTID:2178360185485863

Subject:Computer Science and Technology

Abstract/Summary:

Along with the development of internet and information, search engines become more and more popular. It combines disorganized information and provide much more efficient and useful information. The basic element of search engines is the web page, so this paper focuses on time information mining, structure mining and fingerprint mining of web pages. Accordingly, this paper expatiates on how to improve the retrieval results of search engines from such three aspects: incremental crawling, web pages purification and duplicated web pages deletion.On incremental crawling module, according to the rapid updating frequency of news web sites, this paper adopts a way of mining time information of web pages. It makes use of the time information of pages to decrease the times of crawling web pages and searching in databases. Thus we have efficiently solved the problem of incremental crawling the web sites with high updating frequency, and new web pages can be obtained in good time by users. On page purification module, we describe a web page as a DOM tree, and introduce the number of Chinese punctuation into the weight of page content. By means of pruning the DOM tree, noises of the web page can be reduced. On duplicated pages deletion module, we bring forward a method based on web page purification. By doing the web page purification and distilling characteristic of fingerprint, the accuracy of duplicated pages deletion can be advanced effectively. Furthermore, through applying this deletion method to the clustering of abnormal short texts, we also get fine results.Experiments show that methods in this paper can greatly improve the performance of search engines and provides satisfying results.

Keywords/Search Tags:

search engine, Information gathering, web page analysis, duplicated web page deletion, clustering

Related items

1	Research On NLP-Based Duplicated Web Pages Deletion Algorithm
2	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology
3	Research Of Search Engine
4	Design And Realization Of A Web Page Gathering System With JavaScript Parsing
5	Design And Realization Of A Web Page Gathering System With Javascript Parsing
6	Search Engine Research And Implementation
7	The Optimization And Implement Of Enterprise Search Engine
8	Intelligent And Personalized Research For Web Search Engine
9	Personalized Search Engine Research And Design
10	Research And Implementation On Removing Duplicated WebPages Algorithm Of Search Engine