Font Size: a A A

Parallel Design For Image Compression Algorithms On GPU

Posted on:2015-09-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y DaiFull Text:PDF
GTID:1228330467456549Subject:Agricultural Electrification and Automation
Abstract/Summary:PDF Full Text Request
Image compression coding can effectively reduce the redundant information of imagecontent, which saves storage space and transmission bandwidth. It not only requires a highercompression ratio but also has a faster processing speed when using image compressiontechnology for information processing. Therefore, a key issue in image compressiontechnology is how to improve the efficiency of compression. Today GPU parallel computingbecomes easier with the development of CUDA which defines an API and provides a runtimeenvironment and libraries to easily program these more recent GPUs. So we focus on threeimage compression algorithms with high complexity, and develop the parallel designstrategies for those algorithms on GPU. Meanwhile, some parallel skills of CUDA are alsoexploited to boost the performance of those algorithms further on GPU. The main jobs andinnovations of the dissertation are as follows:(1) The main drawback of the error-resilient entropy coding (EREC) algorithm is its highcomplexity. In order to overcome this disadvantage, a parallel EREC is implemented on aGPU using the NVIDIA CUDA technology. The original EREC is a finer-grained parallel ateach stage which brings additional communication overhead. To achieve high efficiency ofparallel EREC, we propose the partitioning EREC (P-EREC) algorithm, which splitsvariable-length blocks into groups and then every group is coded using the EREC separately.Each GPU thread processes one group so as to make the EREC coarse-grained parallel.(2) In addition, shared memory, texture memory, constant memory, thread allocation forhigh performance and other optimized techniques were fully used in order to obtain higherperformance of the P-EREC on GPU. In the case that the variable-length data blocks aredivided into128groups (256groups, resp.), experimental results show that the parallelP-EREC achieves to (to, resp.) speedup over the original C code ofthe EREC. Compared to the EREC, the P-EREC not only gets a good speedup performanceon the GPU but also slightly improves the resilience of the variable-length code (VLC)bit-stream against burst or random errors.(3) The major shortcoming of the two dimensional orthogonal matching pursuit(2D-OMP) algorithm resides in long computing time for signal recovery. In order toovercome this disadvantage, a novel parallel designing strategy of the2D-OMP algorithm isdeveloped on a GPU. We first analyze the complexity of the2D-OMP and point out that the bottlenecks lie in matrix inverse in the updating weights module and the projection module.Thus we adopt the strategy of matrix inverse update whose performance is superior totraditional methods to reduce the complexity of original matrix inverse, and then theprojection becomes the most time-consuming module. Hence a fast parallel matrix-matrixmultiplication is leveraged to accelerate the projection on GPU.(4) Moreover, a faster matrix-vector multiplication, a parallel reduction algorithm, andsome other parallel skills are also exploited to boost the performance of the2D-OMP on GPU.In the case of the sensing matrix with size of (, resp.) for ascale image, experimental results show that the parallel2D-OMP achieves to(to, resp.) speedup over the original C code. When the sensing matrix isfor a image, the parallel2D-OMP gains to speedup over the serialcode on CPU.(5) Joint-bitplane low density parity check (LDPC) decoding algorithm suffers from thedrawback of long computing time for decoding. Motivated by the evolution of the GPU andthe inherent parallel characteristic of the decoding algorithm, we propose a novel approach forthe computationally intensive processing of joint-bitplane LDPC decoding algorithm on GPU.Two different parallel models are utilized for belief passing between different nodes of thealgorithm. One is the parallel belief propagation (BP) between bit nodes and check nodes andthe other is the parallel BP between symbol nodes and bit nodes.(6) It is found that the bottlenecks of joint-bitplane LDPC decoding algorithm lie incomputing the overall probability mass functions (pmfs) of symbol nodes and the overallbeliefs of bit nodes. Thus a data partitioning method is leveraged in those two models in orderto split a large array into small pieces which can be loaded into L1cache instead of globalmemory. Moreover, coalesced memory access and shared memory are exploited to optimizethe performance of joint-bitplane LDPC decoding algorithm further. Overall, experimentalresults show that when the length-6336LDPC accumulate (LDPCA) code compress thesource, the joint-bitplane LDPC decoder can achieve about speedup on GPU comparedwith the original C code on CPU, while the length-50688LDPCA code with the joint-bitplaneLDPC decoding run about41times faster on GPU than CPU.
Keywords/Search Tags:Image compression, parallel processing, GPU, CUDA, EREC, 2D-OMP, joint-bitplane LDPC decoding algorithm
PDF Full Text Request
Related items