Modern parallel programming models perform their best under the particular patterns they are tuned to express and execute, such as OpenMP for fork/join and Cilk for divide-and-conquer patterns. In cases where the model does not fit the problem, shoehorning of the problem to the model leads to performance bottlenecks, for example by introducing unnecessary dependences. In addition, some of these models, like MPI, have a performance model which thinly veils a particular machine's parameters from the problem that is to be solved.
We postulate that an expressive parallel programming model should not over-constrain the problem it expresses and should not require the application programmer to code for the underlying machine and sacrifice portability. In our former work, we proposed the Data-Driven Tasks model, which constitutes expressive and portable parallelism by only requiring the application programmer to declare the inherent dependences of the application. In this work, we observe another instantiation of macro-dataflow, the Open Community Runtime (OCR) with work-stealing support for directed-acyclic graph (DAG) parallelism.
First, we assess the benefits of these macro-dataflow models over traditional fork/join models using work-stealing, where we match the performance of hand-tuned parallel libraries on today's architectures through DAG parallelism. Secondly, we address work-stealing granularity optimizations for DAG parallelism to address how work stealing can be extended to perform better under complex dependence graphs. Lastly, we observe the impact of locality optimizations for work-stealing runtimes for DAG-parallel applications.
On our path to exascale computations, the priority is shifting from minimizing latency to energy saving as the current trend makes powering an exascale machine very challenging. The trend of providing more parallelism to fit power budgets succeeds if applications can be declared to be more parallel and also scale. We argue that macro-dataflow is a framework that allows programmers to declare unconstrained parallelism. We provide an underlying work-stealing runtime to execute this framework for load balance and scalability, and propose heuristics to extend the default work-stealing approach to better perform with DAG parallel programs. We present our results on a multi-socket many-core machine and a many-core accelerator to showcase the feasibility of our approach on architectures signaling what future architectures may resemble.