Resource Constrained Loop Fusion
Embedded processors have limited on-chip memory. Fusing loops that use the same data can reduce the distance between accesses to the same memory location, avoiding costly off-chip memory transfer. Most existing greedy fusion algorithms solve the unconstrained problem—they do not guard against negative effects of excessive fusion. When a large program contains a great number of loops, unconstrained fusion may generate huge loops that overflow on-chip memory, leading to lower performance. This paper studies the problem for on strained weighted fusion, in which the graph edges carry weights indicating the profitability of fusing the inputs and vertices are annotated with resource requirements. The optimal solution of a constrained weighted fusion problem is a collection of vertex sets such that the total weight associated with pairs of vertices within clusters is maximized and the aggregate resource requirement of every cluster is less than a fixed upper bound R. Finding the optimal solution to a weighted fusion problem (constrained or unconstrained) is P-complete, so we use heuristics. We present two methods. The first picks a group of loops at each fusion step. To ease the resource calculation and fusibility test, the second method picks only a pair of candidate loops at each step. The paper presents the two algorithms, their complexity, and an experimental evaluation.
Technical Report Number