Font Size: a A A

Research On Key Technology Of Massive Image Retrieval Based On Hadoop

Posted on:2014-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2268330401473704Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
For the issue of massive image retrieval, the thesis uses the technology of distributedcomputing to solve two problems of massive image retrieval. The final purpose is to reducethe running time. The thesis completed the following tasks.(1) Hadoop cluster environment has been built in high-performance computing platformand some softwares were installed. This thesis tested the computing performance of thecluster by KNN algorithm. The results fully reflect Hadoop cluster’s ability of processinglarge data files.(2) The inverted index technology was used to build feature library for images in thisthesis. Four algorithms are selected including color histogram, color layout, Tamura and edgehistogram to extract image feature. The four image features are the visual words of image.Before distributed computing, do experiment in stand-alone condition. Firstly, create index for100,000images by using Lucene framework. Then a lot of query images are used to searchindex and calculate the recall and precision. The average recall and average precision rates ofmulti-feature queries are more satisfactory than single feature queries, and the averagetime-consuming of five kinds of queries is a few seconds. The results show inverted indextechnology is suitable for massive image retrieval and the retrieval process is both convenientand fast.(3) The thesis has designed a distributed indexing system. The input file is a sequencefile, so firstly a conversion from massive image to sequence file must be made. According tothe file input format, the master node split the input file and created a map task for each split.The master node assigned the map tasks to compute nodes. Compute node read images on thelocal hard disk and processed images. This could achieve the purpose of parallel processing.Experiments such as some index tests on this system consisting of16nodes have been carriedout, and comparisons of running time with stand-alone processing are made. The results showthat distributed computing could not show the advantage when the number of images is lessthan5000, but with the number of images increases drastically, distributed computing is farbetter.(4) The thesis has designed a distributed searching system. The system segments the index library. Each computing node searches the partial index library and the results arecopied to a free compute node and sorted. The free node outputs a new result. Firstly, retrieveimages from image index libraries composed of10million and100million in clusters ofdifferent sizes respectively. The results reflect the ability of retrieving images from index fileof massive image for Hadoop platform. Then retrieve image in100million images byincreasing compute nodes gradually and determine the size of cluster. Finally do queryexperiment in stand-alone condition and distributed environment respectively. The resultsshow that distributed computing doesn’t reflect its advantages when the number of image isless than1million, but when the number increases gradually, distributed retrieval is muchbetter than stand-alone retrieval.
Keywords/Search Tags:distributed computing, image retrieval, inverted index, Hadoop
PDF Full Text Request
Related items