Font Size: a A A

Based Web Image Search Engine Spiders System Design And Realization

Posted on:2011-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2208360308966916Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, digital camera and scanning technology, the image information on the Internet is constantly enrich and expand at explosive speed. However, for the diversity, complexity and irregularity of the web data, how to get the picture information from massive data quickly has become a challenging task. Image search engine is born to solve this problem.The Web image search engine includes content-based image search engine and text-based image search engine. Content-based image search engine create index for image mainly based on image content (such as color, texture, etc.). And text-based Web image search engine is mainly based on hyperlinks between pages and other text information to label image. However, the current solutions are not efficient, it is difficult to retrieve image accurately.Therefore, our project team propose the technology of Web image search based on spectral graph theory, which combines content-based image search engine and text-based image search engine.It is a new and more effective Web image information analysis method.In building the image search engine, first we need to collect image data with spiders. However, due to the complication of network imformation, we may download a lot of useless data. This will waste network bandwidth and impact of information extraction. Therefore, we extend heritrix and improve features, then designe common spider and precise spider for different sites. For the ordinary site, we give priority to comprehensive information and use the common spider module to download. For the image site, we use the precise spider module to download at the cost of compresiveness.this could guarantee the quantity and quality of image data at a certain degree. After the data download, how to eliminate noise and extract useful description information for the image have became a problem which should be solved. In the thesis, we analyze the web pages'HTML tags and achieve an effective page analysis which can extract the text information about the image's description. This can improve the accuracy and precision of the retrieval system, At the same time ,To ensure the updation of the system,we expand the updating solutions of heritrix which decide whether the page needs to be updated through analyzing the following aspects: page structure ,content and image at the same time.This thesis describes the overall design of the image search engine firstly and describes the data download module,preprocessing module,image classification module and image retrieval module respectively.the common spider and precise spider for data download of different pages are complemented on the basis of analyzing the whole structure ,running processes and important components of spider. To satisfying the system need, the function of page parsing, Chinese word segmentation and the standardization of the image are achieved when processing the data. And analyzed the updating strategy of spider ,proposed more effiective updating strategy in the thesis, which can improve the updating rate of the system effectively. And sorted the queue of spiders by hash operations,optimize the performance of spiders.Finally, the performance of spider and the whole system are tested and analyzed.
Keywords/Search Tags:image search, spider, HTML parser, incremental crawl
PDF Full Text Request
Related items