Scalability and Data Placement on SGI Origin
Cache-coherent non-uniform memory access (ccNUMA) architectures have attracted lots of academic and industry interests as a promising direction to large scale parallel computing. Data placement has been used as a major optimization method on such machines. This study examined the scalability and the effect of data placement on a state-of-the-art ccNUMA machine, SGI Origin, using 16 processors. Three applications from SPLASH-2 are used, FFT, Radix and Barnes-Hut. The results showed that FFT and Radix cannot scale to 16 processors with 70% efficiency even for the largest data sizes tested. Barnes-Hut doesn't scale for small data size but scales linearly for large input size. The results also showed that data placement does not make any difference on performance for all three applications. We attribute these results to the effect of the advanced uni-processor used on the Origin, R10K, the optimizing compiler, and the aggressive communication architecture. Some of our results are quite different from the predictions of two recent simulation studies on directory-based ccNUMA machines (Holt:ISCA96) and (Pai:HPCA97), especially on FFT. These differences are partly due to the fact that the machine models used in previous simulation studies are different from the Origin machine in some important aspects. Our results also include data sizes that are larger than any of the previous simulation studies. To increase our confidence on the latency numbers and data placement tools, we also measured memory latencies using micro-benchmarks.