Research On Optimization Of Trilinear Decomposition Algorithm In Embedded Environment

Posted on:2013-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:K Feng

Full Text:PDF

GTID:2248330395984847

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Because of the capability of simultaneous analysis of complexly composited substance, trilinear decomposition algorithm is widely used in a variety of fields. When it comes to the embedded application promotion stage, it is the low hardware resource usage and unsatisfied performance that becomes kernel problem. The trilinear decomposition algorithm is a complicated procedure and the operations are mainly on matrices. So it’s becoming an urgent problem to figure out the optimization policy of the algorithm on embedded platforms and improve the performance.The instruction scheduler and TLB replacement policy are simplified in nowadays embedded platform compared to desktop platforms. In addition, conventional optimizations on embedded platforms are carried out in a resource constrained situation, while currently there are much more resources in nowadays embedded platforms. However, similar works on those platforms are rarely done. In order to improve the performance of trilinear decomposition algorithm on nowadays embedded system, optimizations are carried out specifically to shorten the execution time of the algorithm. The specific works are as follow:After profiling the trilinear decomposition algorithm and researching the architectural characteristics of the platforms, the matrix multiplication is decided to the main work of the overall optimization. And the maximum speedup rate is calculated to assess the optimization work.On account of characteristics of ARMv7architecture, especially the differences in instruction scheduler and TLB replacement policy between ARMv7and desktop architecture, the blocking algorithm of matrix multiplication in GotoBLAS is optimized to improve the basic performance.Based on the previous step, the matrix multiplication kernel is optimized for the vector calculation feature in NEON. The memory access in partly copy of blocked matrix multiplication is accelerated by making use of the advantage in memory bandwidth of NEON. So the matrix multiplication is optimized in both arithmetical calculation and memory access.In order to verify the effectiveness of the optimization work, the optimized matrix multiplication is implemented. Performance of optimized matrix multiplication is assessed on a variety of ARMv7based platforms. After that, the overall speedup rate is tested by replacing the traditional matrix multiplication in trilinear decomposition algorithm by the optimized one. The experiments show that the optimization is better than other open source libraries. It can bring about7to30times of speedup rate. The performance of trilinear decomposition is improved about2.8times after the optimization.

Keywords/Search Tags:

Optimization on embedded platform, Trilinear decomposition algorithm, Matrix multiplication, ARMv7

PDF Full Text Request

Related items

1	Parallel Algorithms And Architectures For Matrix Computations On FPGA
2	Univariate Polynomial Decomposition And Matrix Multiplication Index Of Improvement In The Limited Domain
3	Research On Key Technology Of Accelerating Floating-Point Matrix Multiplication Based On FPGA In Embedded Environment
4	High Efficient Matrix Operations On Vector-SIMDE DSPs
5	The Algorithm Research Of Low-Rank Matrix Reconstruction For Image Restoration
6	Spark Based Large Scaled Matrix Algorithms
7	Research On Embedded Computing Technology In The Three-way Data Array Analysis Of Complex Samples
8	Optimized Implementation Of Signal Processing Module Based On CUDA
9	The Research Of Matrix Multiplication Efficiency Based On MPI
10	A Study Of The Matrix Operation Harden Implementation On Fpga