Data-parallel Digital Signal Processors: Algorithm Mapping, Architecture Scaling and Workload Adaptation
DSP architectures; stream processors; low power; wireless communications
Emerging applications such as high definition television (HDTV), streaming video, image processing in embedded applications and signal processing in high-speed wireless communications are driving a need for high performance digital signal processors (DSPs) with real-time processing. This class of applications demonstrates significant data parallelism, finite precision, need for power-efficiency and the need for 100's of arithmetic units in the DSP to meet real-time requirements. Data-parallel DSPs meet these requirements by employing clusters of functional units, enabling 100's of computations every clock cycle. These DSPs exploit instruction level parallelism and subword parallelism within clusters, similar to a traditional VLIW (Very Long Instruction Word) DSP, and exploit data parallelism across clusters, similar to vector processors. Stream processors are data-parallel DSPs that use a bandwidth hierarchy to support dataflow to 100's of arithmetic units and are used for evaluating the contributions of this thesis. Different software realizations of the dataflow in the algorithms can affect the performance of stream processors by greater than an order-of-magnitude. The thesis first presents the design of signal processing algorithms that map efficiently on stream processors by parallelizing the algorithms and by re-ordering the flow of data. The design space for stream processors also exhibits trade-offs between arithmetic units per cluster, clusters and the clock frequency to meet the real-time requirements of a given application. This thesis provides a design space exploration tool for stream processors that meets real-time requirements while minimizing power consumption. The presented exploration methodology rapidly searches this design space at compile time to minimize power consumption and selects the number of adders, multipliers, clusters and the real-time clock frequency in the processor. Finally, the thesis improves the power efficiency in the designed stream processor by adapting the compute resources to run-time variations in the workload. The thesis presents an adaptive multiplexer network that allows the number of active clusters to be varied during run-time by turning off unused clusters. Thus, by efficient mapping of algorithms, exploring the architecture design space, and by compute resource adaptation, this thesis improves power efficiency in stream processors and enhances their suitability for high performance, power-aware, signal processing applications.