Architectural, Numerical and Implementation Issues in the VLSI Design of an Integrated CORDIC-SVD Processor
Singular Value Decomposition (SVD); Coordinate Rotation Digital Computer (CORDIC); CMOS
The Singular Value Decomposition (SVD) is an important matrix factorization with applications in signal processing, image processing and robotics. This thesis presents some of the issues involved in the design of an array of special-purpose processors connected in a mesh, for fast real time computation of the SVD. The systolic array implements the Jacobi method for the SVD. This involves plane rotations and inverse tangent calculations and is implemented efficiently in hardware using the Coordinate Rotation Digital Computer (CORDIC) technique. A six chip custom VLSI chip set for the processor was initially developed and tested. This helped identify several bottlenecks and led to an improved design of the single chip version. The single chip implementation incorporates several enhancements that provide greater numerical accuracy. An enhanced architecture which reduces communication was developed within the constraints imposed by VLSI. The chips were fabricated in a 2.0 micron CMOS n-well process using a semicustom design style. The design cycle for future chips can be considerably reduced by adopting a symbolic layout style using high-level VLSI tools such as Octtools from the University of California, Berkeley. Previous architectures for CORDIC processors provided log n bits to guard n bits from truncation errors. A detailed error analysis of the CORDIC iterations indicates that extra guard bits are required to guarantee n bits of precision. In addition, normalization of the input values ensures greater accuracy in the calculations using CORDIC. Anovel normalization scheme for CORDIC which has O(n<sup>1.5</sup>) area complexity as opposed to O(nÂ²) area it would take if this were implemented using conventional hardware, is described. Data dependencies in the array allow only a third of the processors to be active at any time. Several timing schemes which improve the utilization by overlapping communication and computation are developed and evaluated. It is shown that it is possible to effectively utilize all the idle time in the array by concurrently computing the left and right singular vectors along with the singular values. Future versions of the processor can implement these schemes.
MetadataShow full item record
- ECE Publications