| Public Key Cryptography has solved the problem in Symmetric Cryptography that the key transmission and management is very difficult, so it has a vast application in information security. It can protect the data integrity in network transmission and the no deniability between the two parts in a network trade by digital signature and verification. The most important and extensive used public key cryptographic algorithms are RSA and ECC(Elliptic Curve Cryptography). The security of RSA lies in the difficulty of factoring large integers, while the security of ECC lies in the difficulty of the discrete logarithm problem in elliptic curve. ECC has better single bit security than RSA. For a given level of security, the size of the cryptographic keys and operands involved in the computation of ECC are normally much shorter than RSA. Therefore, the memory usage, transmission bandwidth, the computation complexity and the power consumption are greatly reduced. Due to the unique advantage of ECC, it has a huge application field from the embeded systems to high performance servers. The implementation performance of public key algorithm by software can not meet the demand of real time in practice use and the key is very easy to be exposed, so now the hardware coprocessors are used to implement public key algorithm.In this dissertation, the RSA and ECC algorithms are intensively researched, the results indicate that RSA and ECC have the same core basic operations such as modular multiplication, modualr addition and modular exponentiation. Moreover, the point operations can be implemented by this basic operations. Based on the results, a high-performance scalable public key cryptographic coprocessor is proposed, which is used to accelerate the computing of modular multiplication, modular exponentiation, point addition and subtraction, point doubling , scalar multiplication and so on. The whole cryptographic protocols are implemented by the coordinated work of hardware and software. The core of the coprocessor is the modular arithmetic units array for parallel computing, while a single modular arithmetic unit is composed of a high-performance scalable Montgomery modualr multiplication unit and a high-performance scalable modular addition and subtraction unit. The Montgomery modualr multiplication unit which is based on the proposed dual-field unified high radix Montgomery modular algorithm using word as the processing width, has the kernel of multiple processing elements pipeline architecture and can support the dual-field modular multiplication of any operand width. A structure of using word as the processing width is introduced in the modular addition and subtraction unit, which avoids the modular reduction in traditional design of modular addition and subtraction circuit. It is also optimized for ECC operations and simplifies the addition and subtraction in ECC. The modular addition and subtraction unit supports modular addition and subtraction of any operand width. The coprocessor has strong capability for parallel computing, which can support parallel binary algorithm and Chinese Residual Theorem for modualr exponentiation and support the scheduling of point operations in ECC for parallel computing. As a result, the operations in RSA and ECC are accelarated effectively.Many design parameters for this coprocessor such as the data path width, the processing element numbers in the Montgomery modualr multiplication unit and the number of modular arithmetic units should be optimized when the coprocessor is designed concretely. The optimization of these parameters should trade off between the area and performance for variey of applications. Based on 0.18μm CMOS process, many coprocessors are designed for different parametes and the design evaluations are made for those designs. Also, a method is proposed for choosing the best parameters. At last, a coprocessor chip is implemented based on one optimized group parameters. The chip can run at the maximum frequency of 250MHz and has the area of 380k gates. The measured results of the chip show that the coprocessor has great performance for accelarating the computing of RSA and ECC, which can perform one 1024bit modular exponentiation only in 232μs using Chinese Residual Theorem and perform one 192bit scalar multiplication only on 242μs for the prime field and perform one 192bit scalar multiplication only in 222μs for the binary field. |