Based Web Image Search Engine Spiders System Design And Realization

Posted on:2011-05-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhou

Full Text:PDF

GTID:2208360308966916

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, digital camera and scanning technology, the image information on the Internet is constantly enrich and expand at explosive speed. However, for the diversity, complexity and irregularity of the web data, how to get the picture information from massive data quickly has become a challenging task. Image search engine is born to solve this problem.The Web image search engine includes content-based image search engine and text-based image search engine. Content-based image search engine create index for image mainly based on image content (such as color, texture, etc.). And text-based Web image search engine is mainly based on hyperlinks between pages and other text information to label image. However, the current solutions are not efficient, it is difficult to retrieve image accurately.Therefore, our project team propose the technology of Web image search based on spectral graph theory, which combines content-based image search engine and text-based image search engine.It is a new and more effective Web image information analysis method.In building the image search engine, first we need to collect image data with spiders. However, due to the complication of network imformation, we may download a lot of useless data. This will waste network bandwidth and impact of information extraction. Therefore, we extend heritrix and improve features, then designe common spider and precise spider for different sites. For the ordinary site, we give priority to comprehensive information and use the common spider module to download. For the image site, we use the precise spider module to download at the cost of compresiveness.this could guarantee the quantity and quality of image data at a certain degree. After the data download, how to eliminate noise and extract useful description information for the image have became a problem which should be solved. In the thesis, we analyze the web pages'HTML tags and achieve an effective page analysis which can extract the text information about the image's description. This can improve the accuracy and precision of the retrieval system, At the same time ,To ensure the updation of the system,we expand the updating solutions of heritrix which decide whether the page needs to be updated through analyzing the following aspects: page structure ,content and image at the same time.This thesis describes the overall design of the image search engine firstly and describes the data download module,preprocessing module,image classification module and image retrieval module respectively.the common spider and precise spider for data download of different pages are complemented on the basis of analyzing the whole structure ,running processes and important components of spider. To satisfying the system need, the function of page parsing, Chinese word segmentation and the standardization of the image are achieved when processing the data. And analyzed the updating strategy of spider ,proposed more effiective updating strategy in the thesis, which can improve the updating rate of the system effectively. And sorted the queue of spiders by hash operations,optimize the performance of spiders.Finally, the performance of spider and the whole system are tested and analyzed.

Keywords/Search Tags:

image search, spider, HTML parser, incremental crawl

PDF Full Text Request

Related items

1	Research On Subject-Based Incremental Parallel Crawling
2	The Theme Of The Search Engine Web Spider Search Strategy Study
3	Application Research On Image Search Based On Lucene
4	The Research Of Web Information Search Technology Based On Meta-search
5	Design And Implementation Of Search Engine Based On Lucene And HTML Parser
6	The Technology Of Web Information Extraction Based On HTML Parser
7	Search Engine System Inside Web Site Based On Lucene And Heritrix
8	Research And Achievement Of The Search Strategic For The Topic Search Engine Spider
9	Design And Implementation Of A Spider For Topic-Specific Search Engine
10	Crawl Technology Research For Real-time Vertical Search Engine